diff mbox series

fs-writeback: Flush plug before next iteration in wb_writeback()

Message ID 20220415013735.1610091-1-chengzhihao1@huawei.com (mailing list archive)
State New, archived
Headers show
Series fs-writeback: Flush plug before next iteration in wb_writeback() | expand

Commit Message

Zhihao Cheng April 15, 2022, 1:37 a.m. UTC
Commit 505a666ee3fc ("writeback: plug writeback in wb_writeback() and
writeback_inodes_wb()") has us holding a plug during wb_writeback, which
may cause a potential ABBA dead lock:

    wb_writeback		fat_file_fsync
blk_start_plug(&plug)
for (;;) {
  iter i-1: some reqs have been added into plug->mq_list  // LOCK A
  iter i:
    progress = __writeback_inodes_wb(wb, work)
    . writeback_sb_inodes // fat's bdev
    .   __writeback_single_inode
    .   . generic_writepages
    .   .   __block_write_full_page
    .   .   . . 	    __generic_file_fsync
    .   .   . . 	      sync_inode_metadata
    .   .   . . 	        writeback_single_inode
    .   .   . . 		  __writeback_single_inode
    .   .   . . 		    fat_write_inode
    .   .   . . 		      __fat_write_inode
    .   .   . . 		        sync_dirty_buffer	// fat's bdev
    .   .   . . 			  lock_buffer(bh)	// LOCK B
    .   .   . . 			    submit_bh
    .   .   . . 			      blk_mq_get_tag	// LOCK A
    .   .   . trylock_buffer(bh)  // LOCK B
    .   .   .   redirty_page_for_writepage
    .   .   .     wbc->pages_skipped++
    .   .   --wbc->nr_to_write
    .   wrote += write_chunk - wbc.nr_to_write  // wrote > 0
    .   requeue_inode
    .     redirty_tail_locked
    if (progress)    // progress > 0
      continue;
  iter i+1:
      queue_io
      // similar process with iter i, infinite for-loop !
}
blk_finish_plug(&plug)   // flush plug won't be called

Above process triggers a hungtask like:
[  399.044861] INFO: task bb:2607 blocked for more than 30 seconds.
[  399.046824]       Not tainted 5.18.0-rc1-00005-gefae4d9eb6a2-dirty
[  399.051539] task:bb              state:D stack:    0 pid: 2607 ppid:
2426 flags:0x00004000
[  399.051556] Call Trace:
[  399.051570]  __schedule+0x480/0x1050
[  399.051592]  schedule+0x92/0x1a0
[  399.051602]  io_schedule+0x22/0x50
[  399.051613]  blk_mq_get_tag+0x1d3/0x3c0
[  399.051640]  __blk_mq_alloc_requests+0x21d/0x3f0
[  399.051657]  blk_mq_submit_bio+0x68d/0xca0
[  399.051674]  __submit_bio+0x1b5/0x2d0
[  399.051708]  submit_bio_noacct+0x34e/0x720
[  399.051718]  submit_bio+0x3b/0x150
[  399.051725]  submit_bh_wbc+0x161/0x230
[  399.051734]  __sync_dirty_buffer+0xd1/0x420
[  399.051744]  sync_dirty_buffer+0x17/0x20
[  399.051750]  __fat_write_inode+0x289/0x310
[  399.051766]  fat_write_inode+0x2a/0xa0
[  399.051783]  __writeback_single_inode+0x53c/0x6f0
[  399.051795]  writeback_single_inode+0x145/0x200
[  399.051803]  sync_inode_metadata+0x45/0x70
[  399.051856]  __generic_file_fsync+0xa3/0x150
[  399.051880]  fat_file_fsync+0x1d/0x80
[  399.051895]  vfs_fsync_range+0x40/0xb0
[  399.051929]  __x64_sys_fsync+0x18/0x30

In my test, 'need_resched()' (which is imported by 590dca3a71 "fs-writeback:
unplug before cond_resched in writeback_sb_inodes") in function
'writeback_sb_inodes()' seldom comes true, unless cond_resched() is deleted
from write_cache_pages().

Fix it by flush plug before next iteration in wb_writeback().

Goto Link to find a reproducer.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215837
Cc: stable@vger.kernel.org # v4.3
Reported-by: Zhihao Cheng <chengzhihao1@huawei.com>
---
 fs/fs-writeback.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Comments

Christoph Hellwig April 15, 2022, 6:39 a.m. UTC | #1
On Fri, Apr 15, 2022 at 09:37:35AM +0800, Zhihao Cheng wrote:
> +		if (progress) {
> +			/*
> +			 * The progress may be false postive in page redirty
> +			 * case (which is caused by failing to get buffer head
> +			 * lock), which will requeue dirty inodes and start
> +			 * next writeback iteration, and other tasks maybe
> +			 * stuck for getting tags for new requests. So, flush
> +			 * plug to schedule requests holding tags.
> +			 *
> +			 * The code can be removed after buffer head
> +			 * disappering from linux.
> +			 */
> +			blk_flush_plug(current->plug, false);

This basically removes plugging entirely, so we might as well stop
adding the plug if we can't solve it any other way.  But it seems
like that fake progress needs to be fixed instead.
Zhihao Cheng April 15, 2022, 8:39 a.m. UTC | #2
在 2022/4/15 14:39, Christoph Hellwig 写道:
Hi, Christoph
> This basically removes plugging entirely, so we might as well stop
> adding the plug if we can't solve it any other way.  But it seems
> like that fake progress needs to be fixed instead.
> 
Maybe there is a more ideal solution:
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e524c0a1749c..9723f77841f8 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1855,7 +1855,7 @@ static long writeback_sb_inodes(struct super_block 
*sb,

                 wbc_detach_inode(&wbc);
                 work->nr_pages -= write_chunk - wbc.nr_to_write;
-               wrote += write_chunk - wbc.nr_to_write;
+               wrote += write_chunk - wbc.nr_to_write - wbc.pages_skipped;

                 if (need_resched()) {
                         /*

, or following is better(It looks awkward.):

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e524c0a1749c..5f310e53bf1e 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1780,6 +1780,7 @@ static long writeback_sb_inodes(struct super_block 
*sb,
         while (!list_empty(&wb->b_io)) {
                 struct inode *inode = wb_inode(wb->b_io.prev);
                 struct bdi_writeback *tmp_wb;
+               long tmp_wrote;

                 if (inode->i_sb != sb) {
                         if (work->sb) {
@@ -1854,8 +1855,11 @@ static long writeback_sb_inodes(struct 
super_block *sb,
                 __writeback_single_inode(inode, &wbc);

                 wbc_detach_inode(&wbc);
-               work->nr_pages -= write_chunk - wbc.nr_to_write;
-               wrote += write_chunk - wbc.nr_to_write;
+               tmp_wrote = write_chunk - wbc.nr_to_write >= 
wbc.pages_skipped ?
+                           write_chunk - wbc.nr_to_write - 
wbc.pages_skipped :
+                           write_chunk - wbc.nr_to_write;workaround
+               work->nr_pages -= tmp_wrote;
+               wrote += tmp_wrote;

                 if (need_resched()) {
                         /*

It depends on how specific filesystem behaves after invoking 
redirty_page_for_writepage(). Most filesystems return 
AOP_WRITEPAGE_ACTIVATE or 0 after invoking redirty_page_for_writepage() 
in writepage, but there still exist accidential examples:
1. metapage_writepage() could returns -EIO after 
redirty_page_for_writepage()
2. ext4_writepage() could returns -ENOMEM after redirty_page_for_writepage()

write_cache_pages
   error = (*writepage)(page, wbc, data);
   if (unlikely(error)) {
     ...
     break;
   }
   --wbc->nr_to_write   // Skip if 'error < 0'. And if writepage invokes 
redirty_page_for_writepage(), wrote could be negative.


I think the root cause is fsync gets buffer head's lock without locking 
corresponding page, fixing 'progess' and flushing plug are both workarounds.
Christoph Hellwig April 16, 2022, 5:42 a.m. UTC | #3
> I think the root cause is fsync gets buffer head's lock without locking 
> corresponding page, fixing 'progess' and flushing plug are both 
> workarounds.

So let's fix that.
Zhihao Cheng April 16, 2022, 8:41 a.m. UTC | #4
在 2022/4/16 13:42, Christoph Hellwig 写道:
>> I think the root cause is fsync gets buffer head's lock without locking
>> corresponding page, fixing 'progess' and flushing plug are both
>> workarounds.
> 
> So let's fix that.
> 

I think adding page lock before locking buffer head is a little 
difficult and risky:
1. There are too many places getting buffer head before submitting bio, 
and not all filesystems behave same in readpage/writepage/write_inode. 
For example, ntfs_read_block() has locked page before locking buffer 
head and then submitting bh, ext4(no journal) and fat may lock buffer 
head without locking page while writing inode. It's a huge work to check 
all places.
2. Import page lock before locking buffer head may bring new unknown 
problem(other deadlocks about page ?). Taking page lock before locking 
buffer head(in all processes which can be concurrent with wb_writeback) 
is a dangerous thing.

So, how about applying the safe and simple method(flush plug) for the 
time being?
PS: Maybe someday buffer head is removed from all filesystems, then we 
can remove this superfluous blk_flush_plug.
Christoph Hellwig April 18, 2022, 5 a.m. UTC | #5
On Sat, Apr 16, 2022 at 04:41:35PM +0800, Zhihao Cheng wrote:
> So, how about applying the safe and simple method(flush plug) for the time 
> being?

As said before - if you flush everytime here the plug is basically
useless and we could remove it.  But it was added for a reason.  So let's
at least improve the accounting for the skipped writeback as suggested in
your last mail.
diff mbox series

Patch

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 591fe9cf1659..e524c0a1749c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2036,8 +2036,21 @@  static long wb_writeback(struct bdi_writeback *wb,
 		 * mean the overall work is done. So we keep looping as long
 		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (progress)
+		if (progress) {
+			/*
+			 * The progress may be false postive in page redirty
+			 * case (which is caused by failing to get buffer head
+			 * lock), which will requeue dirty inodes and start
+			 * next writeback iteration, and other tasks maybe
+			 * stuck for getting tags for new requests. So, flush
+			 * plug to schedule requests holding tags.
+			 *
+			 * The code can be removed after buffer head
+			 * disappering from linux.
+			 */
+			blk_flush_plug(current->plug, false);
 			continue;
+		}
 		/*
 		 * No more inodes for IO, bail
 		 */