diff mbox series

[v2,1/2] fuse: Fix possible deadlock when writing back dirty pages

Message ID 807bb470f90bae5dcd80a29020d38f6b5dd6ef8e.1616826872.git.baolin.wang@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series [v2,1/2] fuse: Fix possible deadlock when writing back dirty pages | expand

Commit Message

Baolin Wang March 27, 2021, 6:36 a.m. UTC
We can meet below deadlock scenario when writing back dirty pages, and
writing files at the same time. The deadlock scenario can be reproduced
by:

- A writeback worker thread A is trying to write a bunch of dirty pages by
fuse_writepages(), and the fuse_writepages() will lock one page (named page 1),
add it into rb_tree with setting writeback flag, and unlock this page 1,
then try to lock next page (named page 2).

- But at the same time a file writing can be triggered by another process B,
to write several pages by fuse_perform_write(), the fuse_perform_write()
will lock all required pages firstly, then wait for all writeback pages
are completed by fuse_wait_on_page_writeback().

- Now the process B can already lock page 1 and page 2, and wait for page 1
waritehack is completed (page 1 is under writeback set by process A). But
process A can not complete the writeback of page 1, since it is still
waiting for locking page 2, which was locked by process B already.

A deadlock is occurred.

To fix this issue, we should make sure each page writeback is completed
after lock the page in fuse_fill_write_pages() separately, and then write
them together when all pages are stable.

[1450578.772896] INFO: task kworker/u259:6:119885 blocked for more than 120 seconds.
[1450578.796179] kworker/u259:6  D    0 119885      2 0x00000028
[1450578.796185] Workqueue: writeback wb_workfn (flush-0:78)
[1450578.796188] Call trace:
[1450578.798804]  __switch_to+0xd8/0x148
[1450578.802458]  __schedule+0x280/0x6a0
[1450578.806112]  schedule+0x34/0xe8
[1450578.809413]  io_schedule+0x20/0x40
[1450578.812977]  __lock_page+0x164/0x278
[1450578.816718]  write_cache_pages+0x2b0/0x4a8
[1450578.820986]  fuse_writepages+0x84/0x100 [fuse]
[1450578.825592]  do_writepages+0x58/0x108
[1450578.829412]  __writeback_single_inode+0x48/0x448
[1450578.834217]  writeback_sb_inodes+0x220/0x520
[1450578.838647]  __writeback_inodes_wb+0x50/0xe8
[1450578.843080]  wb_writeback+0x294/0x3b8
[1450578.846906]  wb_do_writeback+0x2ec/0x388
[1450578.850992]  wb_workfn+0x80/0x1e0
[1450578.854472]  process_one_work+0x1bc/0x3f0
[1450578.858645]  worker_thread+0x164/0x468
[1450578.862559]  kthread+0x108/0x138
[1450578.865960] INFO: task doio:207752 blocked for more than 120 seconds.
[1450578.888321] doio            D    0 207752 207740 0x00000000
[1450578.888329] Call trace:
[1450578.890945]  __switch_to+0xd8/0x148
[1450578.894599]  __schedule+0x280/0x6a0
[1450578.898255]  schedule+0x34/0xe8
[1450578.901568]  fuse_wait_on_page_writeback+0x8c/0xc8 [fuse]
[1450578.907128]  fuse_perform_write+0x240/0x4e0 [fuse]
[1450578.912082]  fuse_file_write_iter+0x1dc/0x290 [fuse]
[1450578.917207]  do_iter_readv_writev+0x110/0x188
[1450578.921724]  do_iter_write+0x90/0x1c8
[1450578.925598]  vfs_writev+0x84/0xf8
[1450578.929071]  do_writev+0x70/0x110
[1450578.932552]  __arm64_sys_writev+0x24/0x30
[1450578.936727]  el0_svc_common.constprop.0+0x80/0x1f8
[1450578.941694]  el0_svc_handler+0x30/0x80
[1450578.945606]  el0_svc+0x10/0x14

Suggested-by: Peng Tao <tao.peng@linux.alibaba.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
Changes from v1:
 - Use fuse_wait_on_page_writeback() instead to wait for page stable.
---
 fs/fuse/file.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Baolin Wang April 12, 2021, 1:23 p.m. UTC | #1
Hi Miklos,

在 2021/3/27 14:36, Baolin Wang 写道:
> We can meet below deadlock scenario when writing back dirty pages, and
> writing files at the same time. The deadlock scenario can be reproduced
> by:
> 
> - A writeback worker thread A is trying to write a bunch of dirty pages by
> fuse_writepages(), and the fuse_writepages() will lock one page (named page 1),
> add it into rb_tree with setting writeback flag, and unlock this page 1,
> then try to lock next page (named page 2).
> 
> - But at the same time a file writing can be triggered by another process B,
> to write several pages by fuse_perform_write(), the fuse_perform_write()
> will lock all required pages firstly, then wait for all writeback pages
> are completed by fuse_wait_on_page_writeback().
> 
> - Now the process B can already lock page 1 and page 2, and wait for page 1
> waritehack is completed (page 1 is under writeback set by process A). But
> process A can not complete the writeback of page 1, since it is still
> waiting for locking page 2, which was locked by process B already.
> 
> A deadlock is occurred.
> 
> To fix this issue, we should make sure each page writeback is completed
> after lock the page in fuse_fill_write_pages() separately, and then write
> them together when all pages are stable.
> 
> [1450578.772896] INFO: task kworker/u259:6:119885 blocked for more than 120 seconds.
> [1450578.796179] kworker/u259:6  D    0 119885      2 0x00000028
> [1450578.796185] Workqueue: writeback wb_workfn (flush-0:78)
> [1450578.796188] Call trace:
> [1450578.798804]  __switch_to+0xd8/0x148
> [1450578.802458]  __schedule+0x280/0x6a0
> [1450578.806112]  schedule+0x34/0xe8
> [1450578.809413]  io_schedule+0x20/0x40
> [1450578.812977]  __lock_page+0x164/0x278
> [1450578.816718]  write_cache_pages+0x2b0/0x4a8
> [1450578.820986]  fuse_writepages+0x84/0x100 [fuse]
> [1450578.825592]  do_writepages+0x58/0x108
> [1450578.829412]  __writeback_single_inode+0x48/0x448
> [1450578.834217]  writeback_sb_inodes+0x220/0x520
> [1450578.838647]  __writeback_inodes_wb+0x50/0xe8
> [1450578.843080]  wb_writeback+0x294/0x3b8
> [1450578.846906]  wb_do_writeback+0x2ec/0x388
> [1450578.850992]  wb_workfn+0x80/0x1e0
> [1450578.854472]  process_one_work+0x1bc/0x3f0
> [1450578.858645]  worker_thread+0x164/0x468
> [1450578.862559]  kthread+0x108/0x138
> [1450578.865960] INFO: task doio:207752 blocked for more than 120 seconds.
> [1450578.888321] doio            D    0 207752 207740 0x00000000
> [1450578.888329] Call trace:
> [1450578.890945]  __switch_to+0xd8/0x148
> [1450578.894599]  __schedule+0x280/0x6a0
> [1450578.898255]  schedule+0x34/0xe8
> [1450578.901568]  fuse_wait_on_page_writeback+0x8c/0xc8 [fuse]
> [1450578.907128]  fuse_perform_write+0x240/0x4e0 [fuse]
> [1450578.912082]  fuse_file_write_iter+0x1dc/0x290 [fuse]
> [1450578.917207]  do_iter_readv_writev+0x110/0x188
> [1450578.921724]  do_iter_write+0x90/0x1c8
> [1450578.925598]  vfs_writev+0x84/0xf8
> [1450578.929071]  do_writev+0x70/0x110
> [1450578.932552]  __arm64_sys_writev+0x24/0x30
> [1450578.936727]  el0_svc_common.constprop.0+0x80/0x1f8
> [1450578.941694]  el0_svc_handler+0x30/0x80
> [1450578.945606]  el0_svc+0x10/0x14
> 
> Suggested-by: Peng Tao <tao.peng@linux.alibaba.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Do you have any comments for this patch set? Thanks.

> ---
> Changes from v1:
>   - Use fuse_wait_on_page_writeback() instead to wait for page stable.
> ---
>   fs/fuse/file.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 8cccecb..9a30093 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1101,9 +1101,6 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
>   	unsigned int offset, i;
>   	int err;
>   
> -	for (i = 0; i < ap->num_pages; i++)
> -		fuse_wait_on_page_writeback(inode, ap->pages[i]->index);
> -
>   	fuse_write_args_fill(ia, ff, pos, count);
>   	ia->write.in.flags = fuse_write_flags(iocb);
>   	if (fm->fc->handle_killpriv_v2 && !capable(CAP_FSETID))
> @@ -1140,6 +1137,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_args_pages *ap,
>   				     unsigned int max_pages)
>   {
>   	struct fuse_conn *fc = get_fuse_conn(mapping->host);
> +	struct inode *inode = mapping->host;
>   	unsigned offset = pos & (PAGE_SIZE - 1);
>   	size_t count = 0;
>   	int err;
> @@ -1166,6 +1164,8 @@ static ssize_t fuse_fill_write_pages(struct fuse_args_pages *ap,
>   		if (!page)
>   			break;
>   
> +		fuse_wait_on_page_writeback(inode, page->index);
> +
>   		if (mapping_writably_mapped(mapping))
>   			flush_dcache_page(page);
>   
>
Miklos Szeredi April 13, 2021, 8:57 a.m. UTC | #2
On Mon, Apr 12, 2021 at 3:23 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Hi Miklos,
>
> 在 2021/3/27 14:36, Baolin Wang 写道:
> > We can meet below deadlock scenario when writing back dirty pages, and
> > writing files at the same time. The deadlock scenario can be reproduced
> > by:
> >
> > - A writeback worker thread A is trying to write a bunch of dirty pages by
> > fuse_writepages(), and the fuse_writepages() will lock one page (named page 1),
> > add it into rb_tree with setting writeback flag, and unlock this page 1,
> > then try to lock next page (named page 2).
> >
> > - But at the same time a file writing can be triggered by another process B,
> > to write several pages by fuse_perform_write(), the fuse_perform_write()
> > will lock all required pages firstly, then wait for all writeback pages
> > are completed by fuse_wait_on_page_writeback().
> >
> > - Now the process B can already lock page 1 and page 2, and wait for page 1
> > waritehack is completed (page 1 is under writeback set by process A). But
> > process A can not complete the writeback of page 1, since it is still
> > waiting for locking page 2, which was locked by process B already.
> >
> > A deadlock is occurred.
> >
> > To fix this issue, we should make sure each page writeback is completed
> > after lock the page in fuse_fill_write_pages() separately, and then write
> > them together when all pages are stable.
> >
> > [1450578.772896] INFO: task kworker/u259:6:119885 blocked for more than 120 seconds.
> > [1450578.796179] kworker/u259:6  D    0 119885      2 0x00000028
> > [1450578.796185] Workqueue: writeback wb_workfn (flush-0:78)
> > [1450578.796188] Call trace:
> > [1450578.798804]  __switch_to+0xd8/0x148
> > [1450578.802458]  __schedule+0x280/0x6a0
> > [1450578.806112]  schedule+0x34/0xe8
> > [1450578.809413]  io_schedule+0x20/0x40
> > [1450578.812977]  __lock_page+0x164/0x278
> > [1450578.816718]  write_cache_pages+0x2b0/0x4a8
> > [1450578.820986]  fuse_writepages+0x84/0x100 [fuse]
> > [1450578.825592]  do_writepages+0x58/0x108
> > [1450578.829412]  __writeback_single_inode+0x48/0x448
> > [1450578.834217]  writeback_sb_inodes+0x220/0x520
> > [1450578.838647]  __writeback_inodes_wb+0x50/0xe8
> > [1450578.843080]  wb_writeback+0x294/0x3b8
> > [1450578.846906]  wb_do_writeback+0x2ec/0x388
> > [1450578.850992]  wb_workfn+0x80/0x1e0
> > [1450578.854472]  process_one_work+0x1bc/0x3f0
> > [1450578.858645]  worker_thread+0x164/0x468
> > [1450578.862559]  kthread+0x108/0x138
> > [1450578.865960] INFO: task doio:207752 blocked for more than 120 seconds.
> > [1450578.888321] doio            D    0 207752 207740 0x00000000
> > [1450578.888329] Call trace:
> > [1450578.890945]  __switch_to+0xd8/0x148
> > [1450578.894599]  __schedule+0x280/0x6a0
> > [1450578.898255]  schedule+0x34/0xe8
> > [1450578.901568]  fuse_wait_on_page_writeback+0x8c/0xc8 [fuse]
> > [1450578.907128]  fuse_perform_write+0x240/0x4e0 [fuse]
> > [1450578.912082]  fuse_file_write_iter+0x1dc/0x290 [fuse]
> > [1450578.917207]  do_iter_readv_writev+0x110/0x188
> > [1450578.921724]  do_iter_write+0x90/0x1c8
> > [1450578.925598]  vfs_writev+0x84/0xf8
> > [1450578.929071]  do_writev+0x70/0x110
> > [1450578.932552]  __arm64_sys_writev+0x24/0x30
> > [1450578.936727]  el0_svc_common.constprop.0+0x80/0x1f8
> > [1450578.941694]  el0_svc_handler+0x30/0x80
> > [1450578.945606]  el0_svc+0x10/0x14
> >
> > Suggested-by: Peng Tao <tao.peng@linux.alibaba.com>
> > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>
> Do you have any comments for this patch set? Thanks.

Hi,

I guess this is related:

https://lore.kernel.org/linux-fsdevel/20210209100115.GB1208880@miu.piliscsaba.redhat.com/

Can you verify that the patch at the above link fixes your issue?

Thanks,
Miklos
Baolin Wang April 14, 2021, 8:42 a.m. UTC | #3
Hi,

在 2021/4/13 16:57, Miklos Szeredi 写道:
> On Mon, Apr 12, 2021 at 3:23 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>> Hi Miklos,
>>
>> 在 2021/3/27 14:36, Baolin Wang 写道:
>>> We can meet below deadlock scenario when writing back dirty pages, and
>>> writing files at the same time. The deadlock scenario can be reproduced
>>> by:
>>>
>>> - A writeback worker thread A is trying to write a bunch of dirty pages by
>>> fuse_writepages(), and the fuse_writepages() will lock one page (named page 1),
>>> add it into rb_tree with setting writeback flag, and unlock this page 1,
>>> then try to lock next page (named page 2).
>>>
>>> - But at the same time a file writing can be triggered by another process B,
>>> to write several pages by fuse_perform_write(), the fuse_perform_write()
>>> will lock all required pages firstly, then wait for all writeback pages
>>> are completed by fuse_wait_on_page_writeback().
>>>
>>> - Now the process B can already lock page 1 and page 2, and wait for page 1
>>> waritehack is completed (page 1 is under writeback set by process A). But
>>> process A can not complete the writeback of page 1, since it is still
>>> waiting for locking page 2, which was locked by process B already.
>>>
>>> A deadlock is occurred.
>>>
>>> To fix this issue, we should make sure each page writeback is completed
>>> after lock the page in fuse_fill_write_pages() separately, and then write
>>> them together when all pages are stable.
>>>
>>> [1450578.772896] INFO: task kworker/u259:6:119885 blocked for more than 120 seconds.
>>> [1450578.796179] kworker/u259:6  D    0 119885      2 0x00000028
>>> [1450578.796185] Workqueue: writeback wb_workfn (flush-0:78)
>>> [1450578.796188] Call trace:
>>> [1450578.798804]  __switch_to+0xd8/0x148
>>> [1450578.802458]  __schedule+0x280/0x6a0
>>> [1450578.806112]  schedule+0x34/0xe8
>>> [1450578.809413]  io_schedule+0x20/0x40
>>> [1450578.812977]  __lock_page+0x164/0x278
>>> [1450578.816718]  write_cache_pages+0x2b0/0x4a8
>>> [1450578.820986]  fuse_writepages+0x84/0x100 [fuse]
>>> [1450578.825592]  do_writepages+0x58/0x108
>>> [1450578.829412]  __writeback_single_inode+0x48/0x448
>>> [1450578.834217]  writeback_sb_inodes+0x220/0x520
>>> [1450578.838647]  __writeback_inodes_wb+0x50/0xe8
>>> [1450578.843080]  wb_writeback+0x294/0x3b8
>>> [1450578.846906]  wb_do_writeback+0x2ec/0x388
>>> [1450578.850992]  wb_workfn+0x80/0x1e0
>>> [1450578.854472]  process_one_work+0x1bc/0x3f0
>>> [1450578.858645]  worker_thread+0x164/0x468
>>> [1450578.862559]  kthread+0x108/0x138
>>> [1450578.865960] INFO: task doio:207752 blocked for more than 120 seconds.
>>> [1450578.888321] doio            D    0 207752 207740 0x00000000
>>> [1450578.888329] Call trace:
>>> [1450578.890945]  __switch_to+0xd8/0x148
>>> [1450578.894599]  __schedule+0x280/0x6a0
>>> [1450578.898255]  schedule+0x34/0xe8
>>> [1450578.901568]  fuse_wait_on_page_writeback+0x8c/0xc8 [fuse]
>>> [1450578.907128]  fuse_perform_write+0x240/0x4e0 [fuse]
>>> [1450578.912082]  fuse_file_write_iter+0x1dc/0x290 [fuse]
>>> [1450578.917207]  do_iter_readv_writev+0x110/0x188
>>> [1450578.921724]  do_iter_write+0x90/0x1c8
>>> [1450578.925598]  vfs_writev+0x84/0xf8
>>> [1450578.929071]  do_writev+0x70/0x110
>>> [1450578.932552]  __arm64_sys_writev+0x24/0x30
>>> [1450578.936727]  el0_svc_common.constprop.0+0x80/0x1f8
>>> [1450578.941694]  el0_svc_handler+0x30/0x80
>>> [1450578.945606]  el0_svc+0x10/0x14
>>>
>>> Suggested-by: Peng Tao <tao.peng@linux.alibaba.com>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>
>> Do you have any comments for this patch set? Thanks.
> 
> Hi,
> 
> I guess this is related:
> 
> https://lore.kernel.org/linux-fsdevel/20210209100115.GB1208880@miu.piliscsaba.redhat.com/
> 
> Can you verify that the patch at the above link fixes your issue?

Sorry I missed this patch before, and I've tested this patch, it seems 
can solve the deadlock issue I met before.

But look at this patch in detail, I think this patch only reduced the 
deadlock window, but did not remove the possible deadlock scenario 
completely like I explained in the commit log.

Since the fuse_fill_write_pages() can still lock the partitail page in 
your patch, and will be wait for the partitail page waritehack is 
completed if writeback is set in fuse_send_write_pages().

But at the same time, a writeback worker thread may be waiting for 
trying to lock the partitail page to write a bunch of dirty pages by 
fuse_writepages().

Then the deadlock issue can still be occurred. And I think the deadlock 
issue I met is not same with the deadlock issue solved by your patch.
Miklos Szeredi April 14, 2021, 9:02 a.m. UTC | #4
On Wed, Apr 14, 2021 at 10:42 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:

> Sorry I missed this patch before, and I've tested this patch, it seems
> can solve the deadlock issue I met before.

Great, thanks for testing.

> But look at this patch in detail, I think this patch only reduced the
> deadlock window, but did not remove the possible deadlock scenario
> completely like I explained in the commit log.
>
> Since the fuse_fill_write_pages() can still lock the partitail page in
> your patch, and will be wait for the partitail page waritehack is
> completed if writeback is set in fuse_send_write_pages().
>
> But at the same time, a writeback worker thread may be waiting for
> trying to lock the partitail page to write a bunch of dirty pages by
> fuse_writepages().

As you say, fuse_fill_write_pages() will lock a partial page.  This
page cannot become dirty, only after being read completely, which
first requires the page lock.  So dirtying this page can only happen
after the writeback of the fragment was completed.

I don't see how this could lead to a deadlock.

Thanks,
Miklos
Baolin Wang April 14, 2021, 9:22 a.m. UTC | #5
在 2021/4/14 17:02, Miklos Szeredi 写道:
> On Wed, Apr 14, 2021 at 10:42 AM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
> 
>> Sorry I missed this patch before, and I've tested this patch, it seems
>> can solve the deadlock issue I met before.
> 
> Great, thanks for testing.
> 
>> But look at this patch in detail, I think this patch only reduced the
>> deadlock window, but did not remove the possible deadlock scenario
>> completely like I explained in the commit log.
>>
>> Since the fuse_fill_write_pages() can still lock the partitail page in
>> your patch, and will be wait for the partitail page waritehack is
>> completed if writeback is set in fuse_send_write_pages().
>>
>> But at the same time, a writeback worker thread may be waiting for
>> trying to lock the partitail page to write a bunch of dirty pages by
>> fuse_writepages().
> 
> As you say, fuse_fill_write_pages() will lock a partial page.  This
> page cannot become dirty, only after being read completely, which
> first requires the page lock.  So dirtying this page can only happen
> after the writeback of the fragment was completed.

What I mean is the writeback worker had looked up the dirty pages in 
write_cache_pages() and stored them into a temporary pagevec, then try 
to lock dirty page one by one and write them.

For example, suppose it looked up 2 dirty pages (named page 1 and page 
2), and writed down page 1 by fuse_writepages_fill(), unlocked page 1. 
Then try to lock page 2.

At the same time, suppose the fuse_fill_write_pages() will write the 
same page 1 and partitail page 2, and it will lock partital page 2 and 
wait for the page 1's writeback is completed. But page 1's writeback can 
not be completed, since the writeback worker is waiting for locking page 
2, which was already locked by fuse_fill_write_pages().

Does that make sense to you? Or I missed something else?
Miklos Szeredi April 14, 2021, 9:47 a.m. UTC | #6
On Wed, Apr 14, 2021 at 11:22 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> 在 2021/4/14 17:02, Miklos Szeredi 写道:
> > On Wed, Apr 14, 2021 at 10:42 AM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> >
> >> Sorry I missed this patch before, and I've tested this patch, it seems
> >> can solve the deadlock issue I met before.
> >
> > Great, thanks for testing.
> >
> >> But look at this patch in detail, I think this patch only reduced the
> >> deadlock window, but did not remove the possible deadlock scenario
> >> completely like I explained in the commit log.
> >>
> >> Since the fuse_fill_write_pages() can still lock the partitail page in
> >> your patch, and will be wait for the partitail page waritehack is
> >> completed if writeback is set in fuse_send_write_pages().
> >>
> >> But at the same time, a writeback worker thread may be waiting for
> >> trying to lock the partitail page to write a bunch of dirty pages by
> >> fuse_writepages().
> >
> > As you say, fuse_fill_write_pages() will lock a partial page.  This
> > page cannot become dirty, only after being read completely, which
> > first requires the page lock.  So dirtying this page can only happen
> > after the writeback of the fragment was completed.
>
> What I mean is the writeback worker had looked up the dirty pages in
> write_cache_pages() and stored them into a temporary pagevec, then try
> to lock dirty page one by one and write them.
>
> For example, suppose it looked up 2 dirty pages (named page 1 and page
> 2), and writed down page 1 by fuse_writepages_fill(), unlocked page 1.
> Then try to lock page 2.
>
> At the same time, suppose the fuse_fill_write_pages() will write the
> same page 1 and partitail page 2, and it will lock partital page 2 and
> wait for the page 1's writeback is completed. But page 1's writeback can
> not be completed, since the writeback worker is waiting for locking page
> 2, which was already locked by fuse_fill_write_pages().

How would page2 become not uptodate, when it was already collected by
write_cache_pages()?  I.e. page2 is a dirty page, hence it must be
uptodate, and fuse_writepages_fill() will not keep it locked.

Your patch may make sense regardless, but it needs to have a clear
analysis about why the  fuse_wait_on_page_writeback() was needed in
the first place (it's not clear from the history) or why it's okay to
move it.

Thanks,
Miklos
Baolin Wang April 14, 2021, 10:18 a.m. UTC | #7
在 2021/4/14 17:47, Miklos Szeredi 写道:
> On Wed, Apr 14, 2021 at 11:22 AM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> 在 2021/4/14 17:02, Miklos Szeredi 写道:
>>> On Wed, Apr 14, 2021 at 10:42 AM Baolin Wang
>>> <baolin.wang@linux.alibaba.com> wrote:
>>>
>>>> Sorry I missed this patch before, and I've tested this patch, it seems
>>>> can solve the deadlock issue I met before.
>>>
>>> Great, thanks for testing.
>>>
>>>> But look at this patch in detail, I think this patch only reduced the
>>>> deadlock window, but did not remove the possible deadlock scenario
>>>> completely like I explained in the commit log.
>>>>
>>>> Since the fuse_fill_write_pages() can still lock the partitail page in
>>>> your patch, and will be wait for the partitail page waritehack is
>>>> completed if writeback is set in fuse_send_write_pages().
>>>>
>>>> But at the same time, a writeback worker thread may be waiting for
>>>> trying to lock the partitail page to write a bunch of dirty pages by
>>>> fuse_writepages().
>>>
>>> As you say, fuse_fill_write_pages() will lock a partial page.  This
>>> page cannot become dirty, only after being read completely, which
>>> first requires the page lock.  So dirtying this page can only happen
>>> after the writeback of the fragment was completed.
>>
>> What I mean is the writeback worker had looked up the dirty pages in
>> write_cache_pages() and stored them into a temporary pagevec, then try
>> to lock dirty page one by one and write them.
>>
>> For example, suppose it looked up 2 dirty pages (named page 1 and page
>> 2), and writed down page 1 by fuse_writepages_fill(), unlocked page 1.
>> Then try to lock page 2.
>>
>> At the same time, suppose the fuse_fill_write_pages() will write the
>> same page 1 and partitail page 2, and it will lock partital page 2 and
>> wait for the page 1's writeback is completed. But page 1's writeback can
>> not be completed, since the writeback worker is waiting for locking page
>> 2, which was already locked by fuse_fill_write_pages().
> 
> How would page2 become not uptodate, when it was already collected by
> write_cache_pages()?  I.e. page2 is a dirty page, hence it must be
> uptodate, and fuse_writepages_fill() will not keep it locked.

Read your patch carefully again, now I realized you are right, and your 
patch can solve the deadlock issue I met. Please feel free to add my 
tested-by tag for your patch. Thanks.

Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Peng Tao April 14, 2021, 12:22 p.m. UTC | #8
On Tue, Apr 13, 2021 at 5:42 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, Apr 12, 2021 at 3:23 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
> >
> > Hi Miklos,
> >
> > 在 2021/3/27 14:36, Baolin Wang 写道:
> > > We can meet below deadlock scenario when writing back dirty pages, and
> > > writing files at the same time. The deadlock scenario can be reproduced
> > > by:
> > >
> > > - A writeback worker thread A is trying to write a bunch of dirty pages by
> > > fuse_writepages(), and the fuse_writepages() will lock one page (named page 1),
> > > add it into rb_tree with setting writeback flag, and unlock this page 1,
> > > then try to lock next page (named page 2).
> > >
> > > - But at the same time a file writing can be triggered by another process B,
> > > to write several pages by fuse_perform_write(), the fuse_perform_write()
> > > will lock all required pages firstly, then wait for all writeback pages
> > > are completed by fuse_wait_on_page_writeback().
> > >
> > > - Now the process B can already lock page 1 and page 2, and wait for page 1
> > > waritehack is completed (page 1 is under writeback set by process A). But
> > > process A can not complete the writeback of page 1, since it is still
> > > waiting for locking page 2, which was locked by process B already.
> > >
> > > A deadlock is occurred.
> > >
> > > To fix this issue, we should make sure each page writeback is completed
> > > after lock the page in fuse_fill_write_pages() separately, and then write
> > > them together when all pages are stable.
> > >
> > > [1450578.772896] INFO: task kworker/u259:6:119885 blocked for more than 120 seconds.
> > > [1450578.796179] kworker/u259:6  D    0 119885      2 0x00000028
> > > [1450578.796185] Workqueue: writeback wb_workfn (flush-0:78)
> > > [1450578.796188] Call trace:
> > > [1450578.798804]  __switch_to+0xd8/0x148
> > > [1450578.802458]  __schedule+0x280/0x6a0
> > > [1450578.806112]  schedule+0x34/0xe8
> > > [1450578.809413]  io_schedule+0x20/0x40
> > > [1450578.812977]  __lock_page+0x164/0x278
> > > [1450578.816718]  write_cache_pages+0x2b0/0x4a8
> > > [1450578.820986]  fuse_writepages+0x84/0x100 [fuse]
> > > [1450578.825592]  do_writepages+0x58/0x108
> > > [1450578.829412]  __writeback_single_inode+0x48/0x448
> > > [1450578.834217]  writeback_sb_inodes+0x220/0x520
> > > [1450578.838647]  __writeback_inodes_wb+0x50/0xe8
> > > [1450578.843080]  wb_writeback+0x294/0x3b8
> > > [1450578.846906]  wb_do_writeback+0x2ec/0x388
> > > [1450578.850992]  wb_workfn+0x80/0x1e0
> > > [1450578.854472]  process_one_work+0x1bc/0x3f0
> > > [1450578.858645]  worker_thread+0x164/0x468
> > > [1450578.862559]  kthread+0x108/0x138
> > > [1450578.865960] INFO: task doio:207752 blocked for more than 120 seconds.
> > > [1450578.888321] doio            D    0 207752 207740 0x00000000
> > > [1450578.888329] Call trace:
> > > [1450578.890945]  __switch_to+0xd8/0x148
> > > [1450578.894599]  __schedule+0x280/0x6a0
> > > [1450578.898255]  schedule+0x34/0xe8
> > > [1450578.901568]  fuse_wait_on_page_writeback+0x8c/0xc8 [fuse]
> > > [1450578.907128]  fuse_perform_write+0x240/0x4e0 [fuse]
> > > [1450578.912082]  fuse_file_write_iter+0x1dc/0x290 [fuse]
> > > [1450578.917207]  do_iter_readv_writev+0x110/0x188
> > > [1450578.921724]  do_iter_write+0x90/0x1c8
> > > [1450578.925598]  vfs_writev+0x84/0xf8
> > > [1450578.929071]  do_writev+0x70/0x110
> > > [1450578.932552]  __arm64_sys_writev+0x24/0x30
> > > [1450578.936727]  el0_svc_common.constprop.0+0x80/0x1f8
> > > [1450578.941694]  el0_svc_handler+0x30/0x80
> > > [1450578.945606]  el0_svc+0x10/0x14
> > >
> > > Suggested-by: Peng Tao <tao.peng@linux.alibaba.com>
> > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >
> > Do you have any comments for this patch set? Thanks.
>
> Hi,
>
> I guess this is related:
>
> https://lore.kernel.org/linux-fsdevel/20210209100115.GB1208880@miu.piliscsaba.redhat.com/
>
> Can you verify that the patch at the above link fixes your issue?
>
Hi Miklos,

Copying the referred patch here for better discussion.

> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1117,17 +1117,12 @@ static ssize_t fuse_send_write_pages(str
>       count = ia->write.out.size;
>       for (i = 0; i < ap->num_pages; i++) {
>               struct page *page = ap->pages[i];
> +             bool page_locked = ap->page_locked && (i == ap->num_pages - 1);
Any reason for just handling the last locked page in the page array?
To be specific, it look like the first page in the array can also be
partial dirty and locked?

>
> -             if (!err && !offset && count >= PAGE_SIZE)
> -                     SetPageUptodate(page);
> -
> -             if (count > PAGE_SIZE - offset)
> -                     count -= PAGE_SIZE - offset;
> -             else
> -                     count = 0;
> -             offset = 0;
> -
> -             unlock_page(page);
> +             if (err)
> +                     ClearPageUptodate(page);
> +             if (page_locked)
> +                     unlock_page(page);
>               put_page(page);
>       }
>
> @@ -1191,6 +1186,16 @@ static ssize_t fuse_fill_write_pages(str
>               if (offset == PAGE_SIZE)
>                       offset = 0;
>
> +             /* If we copied full page, mark it uptodate */
> +             if (tmp == PAGE_SIZE)
> +                     SetPageUptodate(page);
> +
> +             if (PageUptodate(page)) {
> +                     unlock_page(page);
> +             } else {
> +                     ap->page_locked = true;
> +                     break;
> +             }
Is it possible to still have two pages locked before we start to issue
WRITEs? The deadlock found by Baolin is a stable page handling issue
between fuse_fill_write_pages() and write_cache_pages(). It seems that
as long as we ever lock two or more pages at the same time during
fuse_fill_write_pages(), the race window is there since
write_cache_pages() can set the PG_DIRTY on one page and block locking
the others, whereas fuse_fill_write_pages() can lock all related pages
and block waiting for writeback.

To illustrate the deadlock:

write_cache_pages()                             fuse_perform_write()
lock page A
set PG_DIRTY on page A
                                                              lock
page A and page B in fuse_fill_write_pages()
                                                              wait for
page A and page B writeback in fuse_send_write_pages()
lock page B

Then write_cache_pages() is blocked waiting to lock page B, which page
B is locked by fuse_perform_write() waiting for page A to be written
back.

Cheers,
Tao


>               if (!fc->big_writes)
>                       break;
Miklos Szeredi April 14, 2021, 1:20 p.m. UTC | #9
On Wed, Apr 14, 2021 at 2:22 PM Peng Tao <bergwolf@gmail.com> wrote:
>

> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -1117,17 +1117,12 @@ static ssize_t fuse_send_write_pages(str
> >       count = ia->write.out.size;
> >       for (i = 0; i < ap->num_pages; i++) {
> >               struct page *page = ap->pages[i];
> > +             bool page_locked = ap->page_locked && (i == ap->num_pages - 1);
> Any reason for just handling the last locked page in the page array?
> To be specific, it look like the first page in the array can also be
> partial dirty and locked?

In that case the first partial page will be locked, and it'll break
out of the loop...

> >
> > -             if (!err && !offset && count >= PAGE_SIZE)
> > -                     SetPageUptodate(page);
> > -
> > -             if (count > PAGE_SIZE - offset)
> > -                     count -= PAGE_SIZE - offset;
> > -             else
> > -                     count = 0;
> > -             offset = 0;
> > -
> > -             unlock_page(page);
> > +             if (err)
> > +                     ClearPageUptodate(page);
> > +             if (page_locked)
> > +                     unlock_page(page);
> >               put_page(page);
> >       }
> >
> > @@ -1191,6 +1186,16 @@ static ssize_t fuse_fill_write_pages(str
> >               if (offset == PAGE_SIZE)
> >                       offset = 0;
> >
> > +             /* If we copied full page, mark it uptodate */
> > +             if (tmp == PAGE_SIZE)
> > +                     SetPageUptodate(page);
> > +
> > +             if (PageUptodate(page)) {
> > +                     unlock_page(page);
> > +             } else {
> > +                     ap->page_locked = true;
> > +                     break;

... here, and send it as a separate WRITE request.

So the multi-page case with a partial & non-uptodate head page will
always result in the write request being split into two (even if
there's no partial tail page).

Thanks,
Miklos
Peng Tao April 15, 2021, 12:30 p.m. UTC | #10
On Wed, Apr 14, 2021 at 9:20 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Wed, Apr 14, 2021 at 2:22 PM Peng Tao <bergwolf@gmail.com> wrote:
> >
>
> > > --- a/fs/fuse/file.c
> > > +++ b/fs/fuse/file.c
> > > @@ -1117,17 +1117,12 @@ static ssize_t fuse_send_write_pages(str
> > >       count = ia->write.out.size;
> > >       for (i = 0; i < ap->num_pages; i++) {
> > >               struct page *page = ap->pages[i];
> > > +             bool page_locked = ap->page_locked && (i == ap->num_pages - 1);
> > Any reason for just handling the last locked page in the page array?
> > To be specific, it look like the first page in the array can also be
> > partial dirty and locked?
>
> In that case the first partial page will be locked, and it'll break
> out of the loop...
>
> > >
> > > -             if (!err && !offset && count >= PAGE_SIZE)
> > > -                     SetPageUptodate(page);
> > > -
> > > -             if (count > PAGE_SIZE - offset)
> > > -                     count -= PAGE_SIZE - offset;
> > > -             else
> > > -                     count = 0;
> > > -             offset = 0;
> > > -
> > > -             unlock_page(page);
> > > +             if (err)
> > > +                     ClearPageUptodate(page);
> > > +             if (page_locked)
> > > +                     unlock_page(page);
> > >               put_page(page);
> > >       }
> > >
> > > @@ -1191,6 +1186,16 @@ static ssize_t fuse_fill_write_pages(str
> > >               if (offset == PAGE_SIZE)
> > >                       offset = 0;
> > >
> > > +             /* If we copied full page, mark it uptodate */
> > > +             if (tmp == PAGE_SIZE)
> > > +                     SetPageUptodate(page);
> > > +
> > > +             if (PageUptodate(page)) {
> > > +                     unlock_page(page);
> > > +             } else {
> > > +                     ap->page_locked = true;
> > > +                     break;
>
> ... here, and send it as a separate WRITE request.
>
> So the multi-page case with a partial & non-uptodate head page will
> always result in the write request being split into two (even if
> there's no partial tail page).

Ah, good point! Thanks for the explanation. I agree that it can fix
the deadlock issue here.

One thing I'm still uncertain about is that fuse used to fill the
page, wait for page writeback, and send it to userspace all with the
page locked, which is kind of like a stable page mechanism for FUSE.
With the above change, we no longer lock a PG_uptodate page when
waiting for its writeback and sending it to userspace. Then the page
can be modified when being sent to userspace. Is it acceptable?

Cheers,
Tao
diff mbox series

Patch

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 8cccecb..9a30093 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1101,9 +1101,6 @@  static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
 	unsigned int offset, i;
 	int err;
 
-	for (i = 0; i < ap->num_pages; i++)
-		fuse_wait_on_page_writeback(inode, ap->pages[i]->index);
-
 	fuse_write_args_fill(ia, ff, pos, count);
 	ia->write.in.flags = fuse_write_flags(iocb);
 	if (fm->fc->handle_killpriv_v2 && !capable(CAP_FSETID))
@@ -1140,6 +1137,7 @@  static ssize_t fuse_fill_write_pages(struct fuse_args_pages *ap,
 				     unsigned int max_pages)
 {
 	struct fuse_conn *fc = get_fuse_conn(mapping->host);
+	struct inode *inode = mapping->host;
 	unsigned offset = pos & (PAGE_SIZE - 1);
 	size_t count = 0;
 	int err;
@@ -1166,6 +1164,8 @@  static ssize_t fuse_fill_write_pages(struct fuse_args_pages *ap,
 		if (!page)
 			break;
 
+		fuse_wait_on_page_writeback(inode, page->index);
+
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);