[v3] Btrfs: fix hang on error (such as ENOSPC) when writing extent pages
diff mbox

Message ID 1399652260-9360-1-git-send-email-fdmanana@gmail.com
State Accepted
Headers show

Commit Message

Filipe Manana May 9, 2014, 4:17 p.m. UTC
When running low on available disk space and having several processes
doing buffered file IO, I got the following trace in dmesg:

[ 4202.720152] INFO: task kworker/u8:1:5450 blocked for more than 120 seconds.
[ 4202.720401]       Not tainted 3.13.0-fdm-btrfs-next-26+ #1
[ 4202.720596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4202.720874] kworker/u8:1    D 0000000000000001     0  5450      2 0x00000000
[ 4202.720904] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs]
[ 4202.720908]  ffff8801f62ddc38 0000000000000082 ffff880203ac2490 00000000001d3f40
[ 4202.720913]  ffff8801f62ddfd8 00000000001d3f40 ffff8800c4f0c920 ffff880203ac2490
[ 4202.720918]  00000000001d4a40 ffff88020fe85a40 ffff88020fe85ab8 0000000000000001
[ 4202.720922] Call Trace:
[ 4202.720931]  [<ffffffff816a3cb9>] schedule+0x29/0x70
[ 4202.720950]  [<ffffffffa01ec48d>] btrfs_start_ordered_extent+0x6d/0x110 [btrfs]
[ 4202.720956]  [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
[ 4202.720972]  [<ffffffffa01ec559>] btrfs_run_ordered_extent_work+0x29/0x40 [btrfs]
[ 4202.720988]  [<ffffffffa0201987>] normal_work_helper+0x137/0x2c0 [btrfs]
[ 4202.720994]  [<ffffffff810680e5>] process_one_work+0x1f5/0x530
(...)
[ 4202.721027] 2 locks held by kworker/u8:1/5450:
[ 4202.721028]  #0:  (%s-%s){++++..}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
[ 4202.721037]  #1:  ((&work->normal_work)){+.+...}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
[ 4202.721054] INFO: task btrfs:7891 blocked for more than 120 seconds.
[ 4202.721258]       Not tainted 3.13.0-fdm-btrfs-next-26+ #1
[ 4202.721444] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4202.721699] btrfs           D 0000000000000001     0  7891   7890 0x00000001
[ 4202.721704]  ffff88018c2119e8 0000000000000086 ffff8800a33d2490 00000000001d3f40
[ 4202.721710]  ffff88018c211fd8 00000000001d3f40 ffff8802144b0000 ffff8800a33d2490
[ 4202.721714]  ffff8800d8576640 ffff88020fe85bc0 ffff88020fe85bc8 7fffffffffffffff
[ 4202.721718] Call Trace:
[ 4202.721723]  [<ffffffff816a3cb9>] schedule+0x29/0x70
[ 4202.721727]  [<ffffffff816a2ebc>] schedule_timeout+0x1dc/0x270
[ 4202.721732]  [<ffffffff8109bd79>] ? mark_held_locks+0xb9/0x140
[ 4202.721736]  [<ffffffff816a90c0>] ? _raw_spin_unlock_irq+0x30/0x40
[ 4202.721740]  [<ffffffff8109bf0d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
[ 4202.721744]  [<ffffffff816a488f>] wait_for_completion+0xdf/0x120
[ 4202.721749]  [<ffffffff8107fa90>] ? try_to_wake_up+0x310/0x310
[ 4202.721765]  [<ffffffffa01ebee4>] btrfs_wait_ordered_extents+0x1f4/0x280 [btrfs]
[ 4202.721781]  [<ffffffffa020526e>] btrfs_mksubvol.isra.62+0x30e/0x5a0 [btrfs]
[ 4202.721786]  [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
[ 4202.721799]  [<ffffffffa02056a9>] btrfs_ioctl_snap_create_transid+0x1a9/0x1b0 [btrfs]
[ 4202.721813]  [<ffffffffa020583a>] btrfs_ioctl_snap_create_v2+0x10a/0x170 [btrfs]
(...)

It turns out that extent_io.c:__extent_writepage(), which ends up being called
through filemap_fdatawrite_range() in btrfs_start_ordered_extent(), was getting
-ENOSPC when calling the fill_delalloc callback. In this situation, it returned
without the writepage_end_io_hook callback (inode.c:btrfs_writepage_end_io_hook)
ever being called for the respective page, which prevents the ordered extent's
bytes_left count from ever reaching 0, and therefore a finish_ordered_fn work
is never queued into the endio_write_workers queue. This makes the task that
called btrfs_start_ordered_extent() hang forever on the wait queue of the ordered
extent.

This is fairly easy to reproduce using a small filesystem and fsstress on
a quad core vm:

    mkfs.btrfs -f -b `expr 2100 \* 1024 \* 1024` /dev/sdd
    mount /dev/sdd /mnt

    fsstress -p 6 -d /mnt -n 100000 -x \
        "btrfs subvolume snapshot -r /mnt /mnt/mysnap" \
	    -f allocsp=0 \
	    -f bulkstat=0 \
	    -f bulkstat1=0 \
	    -f chown=0 \
	    -f creat=1 \
	    -f dread=0 \
	    -f dwrite=0 \
	    -f fallocate=1 \
	    -f fdatasync=0 \
	    -f fiemap=0 \
	    -f freesp=0 \
	    -f fsync=0 \
	    -f getattr=0 \
	    -f getdents=0 \
	    -f link=0 \
	    -f mkdir=0 \
	    -f mknod=0 \
	    -f punch=1 \
	    -f read=0 \
	    -f readlink=0 \
	    -f rename=0 \
	    -f resvsp=0 \
	    -f rmdir=0 \
	    -f setxattr=0 \
	    -f stat=0 \
	    -f symlink=0 \
	    -f sync=0 \
	    -f truncate=1 \
	    -f unlink=0 \
	    -f unresvsp=0 \
	    -f write=4

So just ensure that if an error happens while writing the extent page
we call the writepage_end_io_hook callback. Also make it return the
error code and ensure the caller (extent_write_cache_pages) processes
all pages in the page vector even if an error happens only for some
of them, so that ordered extents end up released.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
---

V2: Grab error from `em` pointer if available, do the error check and
    end_extent_writepage call before unlocking the page (just like
    end_bio_extent_writepage does).

V3: Make extent_write_cache_pages not stop processing remaining pages
    in the page vector if __extent_writepage returns an error for some
    page, so that ordered extents end up being processed and released.

 fs/btrfs/extent_io.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

Comments

Marc MERLIN June 17, 2014, 7:21 p.m. UTC | #1
On Fri, May 09, 2014 at 05:17:40PM +0100, Filipe David Borba Manana wrote:
> When running low on available disk space and having several processes
> doing buffered file IO, I got the following trace in dmesg:

I was building 3.15.1 and noticed this patch didn't make it in (your other
2 btrfs send patches I was carrying did make it). Is that on purpose, and
maybe being delayed for 3.16, or is that maybe an oversight?

Thanks,
Marc

> [ 4202.720152] INFO: task kworker/u8:1:5450 blocked for more than 120 seconds.
> [ 4202.720401]       Not tainted 3.13.0-fdm-btrfs-next-26+ #1
> [ 4202.720596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 4202.720874] kworker/u8:1    D 0000000000000001     0  5450      2 0x00000000
> [ 4202.720904] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs]
> [ 4202.720908]  ffff8801f62ddc38 0000000000000082 ffff880203ac2490 00000000001d3f40
> [ 4202.720913]  ffff8801f62ddfd8 00000000001d3f40 ffff8800c4f0c920 ffff880203ac2490
> [ 4202.720918]  00000000001d4a40 ffff88020fe85a40 ffff88020fe85ab8 0000000000000001
> [ 4202.720922] Call Trace:
> [ 4202.720931]  [<ffffffff816a3cb9>] schedule+0x29/0x70
> [ 4202.720950]  [<ffffffffa01ec48d>] btrfs_start_ordered_extent+0x6d/0x110 [btrfs]
> [ 4202.720956]  [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
> [ 4202.720972]  [<ffffffffa01ec559>] btrfs_run_ordered_extent_work+0x29/0x40 [btrfs]
> [ 4202.720988]  [<ffffffffa0201987>] normal_work_helper+0x137/0x2c0 [btrfs]
> [ 4202.720994]  [<ffffffff810680e5>] process_one_work+0x1f5/0x530
> (...)
> [ 4202.721027] 2 locks held by kworker/u8:1/5450:
> [ 4202.721028]  #0:  (%s-%s){++++..}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
> [ 4202.721037]  #1:  ((&work->normal_work)){+.+...}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
> [ 4202.721054] INFO: task btrfs:7891 blocked for more than 120 seconds.
> [ 4202.721258]       Not tainted 3.13.0-fdm-btrfs-next-26+ #1
> [ 4202.721444] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 4202.721699] btrfs           D 0000000000000001     0  7891   7890 0x00000001
> [ 4202.721704]  ffff88018c2119e8 0000000000000086 ffff8800a33d2490 00000000001d3f40
> [ 4202.721710]  ffff88018c211fd8 00000000001d3f40 ffff8802144b0000 ffff8800a33d2490
> [ 4202.721714]  ffff8800d8576640 ffff88020fe85bc0 ffff88020fe85bc8 7fffffffffffffff
> [ 4202.721718] Call Trace:
> [ 4202.721723]  [<ffffffff816a3cb9>] schedule+0x29/0x70
> [ 4202.721727]  [<ffffffff816a2ebc>] schedule_timeout+0x1dc/0x270
> [ 4202.721732]  [<ffffffff8109bd79>] ? mark_held_locks+0xb9/0x140
> [ 4202.721736]  [<ffffffff816a90c0>] ? _raw_spin_unlock_irq+0x30/0x40
> [ 4202.721740]  [<ffffffff8109bf0d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
> [ 4202.721744]  [<ffffffff816a488f>] wait_for_completion+0xdf/0x120
> [ 4202.721749]  [<ffffffff8107fa90>] ? try_to_wake_up+0x310/0x310
> [ 4202.721765]  [<ffffffffa01ebee4>] btrfs_wait_ordered_extents+0x1f4/0x280 [btrfs]
> [ 4202.721781]  [<ffffffffa020526e>] btrfs_mksubvol.isra.62+0x30e/0x5a0 [btrfs]
> [ 4202.721786]  [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
> [ 4202.721799]  [<ffffffffa02056a9>] btrfs_ioctl_snap_create_transid+0x1a9/0x1b0 [btrfs]
> [ 4202.721813]  [<ffffffffa020583a>] btrfs_ioctl_snap_create_v2+0x10a/0x170 [btrfs]
> (...)
> 
> It turns out that extent_io.c:__extent_writepage(), which ends up being called
> through filemap_fdatawrite_range() in btrfs_start_ordered_extent(), was getting
> -ENOSPC when calling the fill_delalloc callback. In this situation, it returned
> without the writepage_end_io_hook callback (inode.c:btrfs_writepage_end_io_hook)
> ever being called for the respective page, which prevents the ordered extent's
> bytes_left count from ever reaching 0, and therefore a finish_ordered_fn work
> is never queued into the endio_write_workers queue. This makes the task that
> called btrfs_start_ordered_extent() hang forever on the wait queue of the ordered
> extent.
> 
> This is fairly easy to reproduce using a small filesystem and fsstress on
> a quad core vm:
> 
>     mkfs.btrfs -f -b `expr 2100 \* 1024 \* 1024` /dev/sdd
>     mount /dev/sdd /mnt
> 
>     fsstress -p 6 -d /mnt -n 100000 -x \
>         "btrfs subvolume snapshot -r /mnt /mnt/mysnap" \
> 	    -f allocsp=0 \
> 	    -f bulkstat=0 \
> 	    -f bulkstat1=0 \
> 	    -f chown=0 \
> 	    -f creat=1 \
> 	    -f dread=0 \
> 	    -f dwrite=0 \
> 	    -f fallocate=1 \
> 	    -f fdatasync=0 \
> 	    -f fiemap=0 \
> 	    -f freesp=0 \
> 	    -f fsync=0 \
> 	    -f getattr=0 \
> 	    -f getdents=0 \
> 	    -f link=0 \
> 	    -f mkdir=0 \
> 	    -f mknod=0 \
> 	    -f punch=1 \
> 	    -f read=0 \
> 	    -f readlink=0 \
> 	    -f rename=0 \
> 	    -f resvsp=0 \
> 	    -f rmdir=0 \
> 	    -f setxattr=0 \
> 	    -f stat=0 \
> 	    -f symlink=0 \
> 	    -f sync=0 \
> 	    -f truncate=1 \
> 	    -f unlink=0 \
> 	    -f unresvsp=0 \
> 	    -f write=4
> 
> So just ensure that if an error happens while writing the extent page
> we call the writepage_end_io_hook callback. Also make it return the
> error code and ensure the caller (extent_write_cache_pages) processes
> all pages in the page vector even if an error happens only for some
> of them, so that ordered extents end up released.
> 
> Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
> ---
> 
> V2: Grab error from `em` pointer if available, do the error check and
>     end_extent_writepage call before unlocking the page (just like
>     end_bio_extent_writepage does).
> 
> V3: Make extent_write_cache_pages not stop processing remaining pages
>     in the page vector if __extent_writepage returns an error for some
>     page, so that ordered extents end up being processed and released.
> 
>  fs/btrfs/extent_io.c | 16 +++++++++++-----
>  1 file changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 0c43896..4468afd 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3284,6 +3284,7 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
>  				     end - cur + 1, 1);
>  		if (IS_ERR_OR_NULL(em)) {
>  			SetPageError(page);
> +			ret = PTR_ERR_OR_ZERO(em);
>  			break;
>  		}
>  
> @@ -3370,13 +3371,17 @@ done:
>  		set_page_writeback(page);
>  		end_page_writeback(page);
>  	}
> +	if (PageError(page)) {
> +		ret = ret < 0 ? ret : -EIO;
> +		end_extent_writepage(page, ret, start, page_end);
> +	}
>  	unlock_page(page);
>  
>  done_unlocked:
>  
>  	/* drop our reference on any cached states */
>  	free_extent_state(cached_state);
> -	return 0;
> +	return ret;
>  }
>  
>  static int eb_wait(void *word)
> @@ -3699,6 +3704,7 @@ static int extent_write_cache_pages(struct extent_io_tree *tree,
>  	struct inode *inode = mapping->host;
>  	int ret = 0;
>  	int done = 0;
> +	int err = 0;
>  	int nr_to_write_done = 0;
>  	struct pagevec pvec;
>  	int nr_pages;
> @@ -3785,8 +3791,8 @@ retry:
>  				unlock_page(page);
>  				ret = 0;
>  			}
> -			if (ret)
> -				done = 1;
> +			if (!err && ret < 0)
> +				err = ret;
>  
>  			/*
>  			 * the filesystem may choose to bump up nr_to_write.
> @@ -3798,7 +3804,7 @@ retry:
>  		pagevec_release(&pvec);
>  		cond_resched();
>  	}
> -	if (!scanned && !done) {
> +	if (!scanned && !done && !err) {
>  		/*
>  		 * We hit the last page and there is more work to be done: wrap
>  		 * back to the start of the file
> @@ -3808,7 +3814,7 @@ retry:
>  		goto retry;
>  	}
>  	btrfs_add_delayed_iput(inode);
> -	return ret;
> +	return err;
>  }
>  
>  static void flush_epd_write_bio(struct extent_page_data *epd)
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0c43896..4468afd 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3284,6 +3284,7 @@  static int __extent_writepage(struct page *page, struct writeback_control *wbc,
 				     end - cur + 1, 1);
 		if (IS_ERR_OR_NULL(em)) {
 			SetPageError(page);
+			ret = PTR_ERR_OR_ZERO(em);
 			break;
 		}
 
@@ -3370,13 +3371,17 @@  done:
 		set_page_writeback(page);
 		end_page_writeback(page);
 	}
+	if (PageError(page)) {
+		ret = ret < 0 ? ret : -EIO;
+		end_extent_writepage(page, ret, start, page_end);
+	}
 	unlock_page(page);
 
 done_unlocked:
 
 	/* drop our reference on any cached states */
 	free_extent_state(cached_state);
-	return 0;
+	return ret;
 }
 
 static int eb_wait(void *word)
@@ -3699,6 +3704,7 @@  static int extent_write_cache_pages(struct extent_io_tree *tree,
 	struct inode *inode = mapping->host;
 	int ret = 0;
 	int done = 0;
+	int err = 0;
 	int nr_to_write_done = 0;
 	struct pagevec pvec;
 	int nr_pages;
@@ -3785,8 +3791,8 @@  retry:
 				unlock_page(page);
 				ret = 0;
 			}
-			if (ret)
-				done = 1;
+			if (!err && ret < 0)
+				err = ret;
 
 			/*
 			 * the filesystem may choose to bump up nr_to_write.
@@ -3798,7 +3804,7 @@  retry:
 		pagevec_release(&pvec);
 		cond_resched();
 	}
-	if (!scanned && !done) {
+	if (!scanned && !done && !err) {
 		/*
 		 * We hit the last page and there is more work to be done: wrap
 		 * back to the start of the file
@@ -3808,7 +3814,7 @@  retry:
 		goto retry;
 	}
 	btrfs_add_delayed_iput(inode);
-	return ret;
+	return err;
 }
 
 static void flush_epd_write_bio(struct extent_page_data *epd)