diff mbox

Btrfs: fix crash of compressed writes

Message ID 1380544797-29798-1-git-send-email-bo.li.liu@oracle.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Liu Bo Sept. 30, 2013, 12:39 p.m. UTC
The crash[1] is found by xfstests/generic/208 with "-o compress",
it's not reproduced everytime, but it does panic.

The bug is quite interesting, it's actually introduced by a recent commit
(573aecafca1cf7a974231b759197a1aebcf39c2a,
Btrfs: actually limit the size of delalloc range).

Btrfs implements delay allocation, so during writeback, we
(1) get a page A and lock it
(2) search the state tree for delalloc bytes and lock all pages within the range
(3) process the delalloc range, including find disk space and create
    ordered extent and so on.
(4) submit the page A.

It runs well in normal cases, but if we're in a racy case, eg.
buffered compressed writes and aio-dio writes,
sometimes we may fail to lock all pages in the 'delalloc' range,
in which case, we need to fall back to search the state tree again with
a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset).

The mentioned commit has a side effect, that is, in the fallback case,
we can find delalloc bytes before the index of the page we already have locked,
so we're in the case of (delalloc_end <= *start) and return with (found > 0).

This ends with not locking delalloc pages but making ->writepage still
process them, and the crash happens.

This fixes it by not enforcing the 'max_bytes' limit in the special fallback
case.

[1]:
------------[ cut here ]------------
kernel BUG at mm/page-writeback.c:2170!
[...]
CPU: 2 PID: 11755 Comm: btrfs-delalloc- Tainted: G           O 3.11.0+ #8
[...]
RIP: 0010:[<ffffffff810f5093>]  [<ffffffff810f5093>] clear_page_dirty_for_io+0x1e/0x83
[...]
[ 4934.248731] Stack:
[ 4934.248731]  ffff8801477e5dc8 ffffea00049b9f00 ffff8801869f9ce8 ffffffffa02b841a
[ 4934.248731]  0000000000000000 0000000000000000 0000000000000fff 0000000000000620
[ 4934.248731]  ffff88018db59c78 ffffea0005da8d40 ffffffffa02ff860 00000001810016c0
[ 4934.248731] Call Trace:
[ 4934.248731]  [<ffffffffa02b841a>] extent_range_clear_dirty_for_io+0xcf/0xf5 [btrfs]
[ 4934.248731]  [<ffffffffa02a8889>] compress_file_range+0x1dc/0x4cb [btrfs]
[ 4934.248731]  [<ffffffff8104f7af>] ? detach_if_pending+0x22/0x4b
[ 4934.248731]  [<ffffffffa02a8bad>] async_cow_start+0x35/0x53 [btrfs]
[ 4934.248731]  [<ffffffffa02c694b>] worker_loop+0x14b/0x48c [btrfs]
[ 4934.248731]  [<ffffffffa02c6800>] ? btrfs_queue_worker+0x25c/0x25c [btrfs]
[ 4934.248731]  [<ffffffff810608f5>] kthread+0x8d/0x95
[ 4934.248731]  [<ffffffff81060868>] ? kthread_freezable_should_stop+0x43/0x43
[ 4934.248731]  [<ffffffff814fe09c>] ret_from_fork+0x7c/0xb0
[ 4934.248731]  [<ffffffff81060868>] ? kthread_freezable_should_stop+0x43/0x43
[ 4934.248731] Code: ff 85 c0 0f 94 c0 0f b6 c0 59 5b 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 2c de 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 52 49 8b 84 24 80 00 00 00 f6 40 20 01 75 44
[ 4934.248731] RIP  [<ffffffff810f5093>] clear_page_dirty_for_io+0x1e/0x83
[ 4934.248731]  RSP <ffff8801869f9c48>
[ 4934.280307] ---[ end trace 36f06d3f8750236a ]---

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/extent_io.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Josef Bacik Sept. 30, 2013, 5:02 p.m. UTC | #1
On Mon, Sep 30, 2013 at 08:39:57PM +0800, Liu Bo wrote:
> The crash[1] is found by xfstests/generic/208 with "-o compress",
> it's not reproduced everytime, but it does panic.
> 
> The bug is quite interesting, it's actually introduced by a recent commit
> (573aecafca1cf7a974231b759197a1aebcf39c2a,
> Btrfs: actually limit the size of delalloc range).
> 
> Btrfs implements delay allocation, so during writeback, we
> (1) get a page A and lock it
> (2) search the state tree for delalloc bytes and lock all pages within the range
> (3) process the delalloc range, including find disk space and create
>     ordered extent and so on.
> (4) submit the page A.
> 
> It runs well in normal cases, but if we're in a racy case, eg.
> buffered compressed writes and aio-dio writes,
> sometimes we may fail to lock all pages in the 'delalloc' range,
> in which case, we need to fall back to search the state tree again with
> a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset).
> 
> The mentioned commit has a side effect, that is, in the fallback case,
> we can find delalloc bytes before the index of the page we already have locked,
> so we're in the case of (delalloc_end <= *start) and return with (found > 0).
> 
> This ends with not locking delalloc pages but making ->writepage still
> process them, and the crash happens.
> 
> This fixes it by not enforcing the 'max_bytes' limit in the special fallback
> case.
> 

Great analysis, thank you for that, however I don't like the fix ;).  Instead We
need to change the

if (!found || delalloc_end <= *start)

to always return 0 since we should not call fill delalloc if the delalloc range
doesn't start at our current offset.  Secondly the

max_bytes = PAGE_CACHE_SIZE - offset;

is completely crap since we are talking about the entire page/sector.  We should
simply set max_bytes to sectorszie or PAGE_CACHE_SIZE that way we avoid this
issue.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo Oct. 1, 2013, 3:37 a.m. UTC | #2
On Mon, Sep 30, 2013 at 01:02:49PM -0400, Josef Bacik wrote:
> On Mon, Sep 30, 2013 at 08:39:57PM +0800, Liu Bo wrote:
> > The crash[1] is found by xfstests/generic/208 with "-o compress",
> > it's not reproduced everytime, but it does panic.
> > 
> > The bug is quite interesting, it's actually introduced by a recent commit
> > (573aecafca1cf7a974231b759197a1aebcf39c2a,
> > Btrfs: actually limit the size of delalloc range).
> > 
> > Btrfs implements delay allocation, so during writeback, we
> > (1) get a page A and lock it
> > (2) search the state tree for delalloc bytes and lock all pages within the range
> > (3) process the delalloc range, including find disk space and create
> >     ordered extent and so on.
> > (4) submit the page A.
> > 
> > It runs well in normal cases, but if we're in a racy case, eg.
> > buffered compressed writes and aio-dio writes,
> > sometimes we may fail to lock all pages in the 'delalloc' range,
> > in which case, we need to fall back to search the state tree again with
> > a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset).
> > 
> > The mentioned commit has a side effect, that is, in the fallback case,
> > we can find delalloc bytes before the index of the page we already have locked,
> > so we're in the case of (delalloc_end <= *start) and return with (found > 0).
> > 
> > This ends with not locking delalloc pages but making ->writepage still
> > process them, and the crash happens.
> > 
> > This fixes it by not enforcing the 'max_bytes' limit in the special fallback
> > case.
> > 
> 
> Great analysis, thank you for that, however I don't like the fix ;).  Instead We
> need to change the
> 
> if (!found || delalloc_end <= *start)
> 
> to always return 0 since we should not call fill delalloc if the delalloc range
> doesn't start at our current offset.  Secondly the

Setting 'found = 0' is also the first idea into my mind :), but I'm not sure if
it's right.

I've check it out, commit 70b99e6959a4c28ae1b314985eca731f3db72f1d
(Btrfs: Compression corner fixes,
Signed-off-by: Chris Mason <chris.mason@oracle.com>)
introduced the (delalloc_end <= *start).

The commit log says, 
"Make sure we lock pages in the correct order during delalloc.  The
 search into the state tree for delalloc bytes can return bytes
 before the page we already have locked."

But I think if we find bytes before the page we locked, then we
should get (found = 0).

So I don't know why we need the check (delalloc_end <= *start),
maybe some corner cases need that? Chris can explain a bit?

> 
> max_bytes = PAGE_CACHE_SIZE - offset;
> 
> is completely crap since we are talking about the entire page/sector.  We should
> simply set max_bytes to sectorszie or PAGE_CACHE_SIZE that way we avoid this
> issue.  Thanks,

This might be due to historical reasons, but for now, it doesn't cause
problems as the passed @start is page aligned and @offset is always 0.

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Oct. 1, 2013, 12:38 p.m. UTC | #3
On Tue, Oct 01, 2013 at 11:37:22AM +0800, Liu Bo wrote:
> On Mon, Sep 30, 2013 at 01:02:49PM -0400, Josef Bacik wrote:
> > On Mon, Sep 30, 2013 at 08:39:57PM +0800, Liu Bo wrote:
> > > The crash[1] is found by xfstests/generic/208 with "-o compress",
> > > it's not reproduced everytime, but it does panic.
> > > 
> > > The bug is quite interesting, it's actually introduced by a recent commit
> > > (573aecafca1cf7a974231b759197a1aebcf39c2a,
> > > Btrfs: actually limit the size of delalloc range).
> > > 
> > > Btrfs implements delay allocation, so during writeback, we
> > > (1) get a page A and lock it
> > > (2) search the state tree for delalloc bytes and lock all pages within the range
> > > (3) process the delalloc range, including find disk space and create
> > >     ordered extent and so on.
> > > (4) submit the page A.
> > > 
> > > It runs well in normal cases, but if we're in a racy case, eg.
> > > buffered compressed writes and aio-dio writes,
> > > sometimes we may fail to lock all pages in the 'delalloc' range,
> > > in which case, we need to fall back to search the state tree again with
> > > a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset).
> > > 
> > > The mentioned commit has a side effect, that is, in the fallback case,
> > > we can find delalloc bytes before the index of the page we already have locked,
> > > so we're in the case of (delalloc_end <= *start) and return with (found > 0).
> > > 
> > > This ends with not locking delalloc pages but making ->writepage still
> > > process them, and the crash happens.
> > > 
> > > This fixes it by not enforcing the 'max_bytes' limit in the special fallback
> > > case.
> > > 
> > 
> > Great analysis, thank you for that, however I don't like the fix ;).  Instead We
> > need to change the
> > 
> > if (!found || delalloc_end <= *start)
> > 
> > to always return 0 since we should not call fill delalloc if the delalloc range
> > doesn't start at our current offset.  Secondly the
> 
> Setting 'found = 0' is also the first idea into my mind :), but I'm not sure if
> it's right.
> 
> I've check it out, commit 70b99e6959a4c28ae1b314985eca731f3db72f1d
> (Btrfs: Compression corner fixes,
> Signed-off-by: Chris Mason <chris.mason@oracle.com>)
> introduced the (delalloc_end <= *start).
> 
> The commit log says, 
> "Make sure we lock pages in the correct order during delalloc.  The
>  search into the state tree for delalloc bytes can return bytes
>  before the page we already have locked."
> 
> But I think if we find bytes before the page we locked, then we
> should get (found = 0).
> 
> So I don't know why we need the check (delalloc_end <= *start),
> maybe some corner cases need that? Chris can explain a bit?
> 

No we needed his other addition below which is important, for when the delalloc
range starts before but still encompasses our page.  The other part is not
needed, we should be returning 0 if delalloc_end lands before or on our start,
which should fix your bug and make it a bit more sane.

> > 
> > max_bytes = PAGE_CACHE_SIZE - offset;
> > 
> > is completely crap since we are talking about the entire page/sector.  We should
> > simply set max_bytes to sectorszie or PAGE_CACHE_SIZE that way we avoid this
> > issue.  Thanks,
> 
> This might be due to historical reasons, but for now, it doesn't cause
> problems as the passed @start is page aligned and @offset is always 0.
> 

Yeah you are right, still I'd kill it anyway since it doesn't make sense.
Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo Oct. 1, 2013, 3:42 p.m. UTC | #4
On Tue, Oct 01, 2013 at 08:38:30AM -0400, Josef Bacik wrote:
> No we needed his other addition below which is important, for when the delalloc
> range starts before but still encompasses our page.  The other part is not
> needed, we should be returning 0 if delalloc_end lands before or on our start,
> which should fix your bug and make it a bit more sane.

Got it, I'm sending a new version.

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c09a40d..76a977e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1483,7 +1483,12 @@  static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
 		node = rb_next(node);
 		total_bytes += state->end - state->start + 1;
 		if (total_bytes >= max_bytes) {
-			*end = *start + max_bytes - 1;
+			/*
+			 * tell if we fall back from the failure of
+			 * lock_delalloc_pages()
+			 */
+			if (max_bytes > PAGE_CACHE_SIZE)
+				*end = *start + max_bytes - 1;
 			break;
 		}
 		if (!node)