[v7,1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
diff mbox

Message ID 20170308022552.14686-1-quwenruo@cn.fujitsu.com
State New
Headers show

Commit Message

Qu Wenruo March 8, 2017, 2:25 a.m. UTC
[BUG]
When btrfs_reloc_clone_csum() reports error, it can underflow metadata
and leads to kernel assertion on outstanding extents in
run_delalloc_nocow() and cow_file_range().

 BTRFS info (device vdb5): relocating block group 12582912 flags data
 BTRFS info (device vdb5): found 1 extents
 assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858

Currently, due to another bug blocking ordered extents, the bug is only
reproducible under certain block group layout and using error injection.

a) Create one data block group with one 4K extent in it.
   To avoid the bug that hangs btrfs due to ordered extent which never
   finishes
b) Make btrfs_reloc_clone_csum() always fail
c) Relocate that block group

[CAUSE]
run_delalloc_nocow() and cow_file_range() handles error from
btrfs_reloc_clone_csum() wrongly:

(The ascii chart shows a more generic case of this bug other than the
bug mentioned above)

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
                    |<----------- cleanup range --------------->|
|<-----------  ----------->|
             \/
 btrfs_finish_ordered_io() range

So error handler, which calls extent_clear_unlock_delalloc() with
EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
will both cover OE n, and free its metadata, causing metadata under flow.

[Fix]
The fix is to ensure after calling btrfs_add_ordered_extent(), we only
call error handler after increasing the iteration offset, so that
cleanup range won't cover any created ordered extent.

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<-----------  ----------->|<---------- cleanup range --------->|
             \/
 btrfs_finish_ordered_io() range

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
v6:
  New, split from v5 patch, as this is a separate bug.
v7:
  Fix a wrong operator pointed out by Liu Bo.
  Test case will follow soon.
---
 fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 39 insertions(+), 12 deletions(-)

Comments

Filipe Manana March 8, 2017, 10:18 a.m. UTC | #1
On Wed, Mar 8, 2017 at 2:25 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> [BUG]
> When btrfs_reloc_clone_csum() reports error, it can underflow metadata
> and leads to kernel assertion on outstanding extents in
> run_delalloc_nocow() and cow_file_range().
>
>  BTRFS info (device vdb5): relocating block group 12582912 flags data
>  BTRFS info (device vdb5): found 1 extents
>  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
>
> Currently, due to another bug blocking ordered extents, the bug is only
> reproducible under certain block group layout and using error injection.
>
> a) Create one data block group with one 4K extent in it.
>    To avoid the bug that hangs btrfs due to ordered extent which never
>    finishes
> b) Make btrfs_reloc_clone_csum() always fail
> c) Relocate that block group
>
> [CAUSE]
> run_delalloc_nocow() and cow_file_range() handles error from
> btrfs_reloc_clone_csum() wrongly:
>
> (The ascii chart shows a more generic case of this bug other than the
> bug mentioned above)
>
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
>                     |<----------- cleanup range --------------->|
> |<-----------  ----------->|
>              \/
>  btrfs_finish_ordered_io() range
>
> So error handler, which calls extent_clear_unlock_delalloc() with
> EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
> will both cover OE n, and free its metadata, causing metadata under flow.
>
> [Fix]
> The fix is to ensure after calling btrfs_add_ordered_extent(), we only
> call error handler after increasing the iteration offset, so that
> cleanup range won't cover any created ordered extent.
>
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
> |<-----------  ----------->|<---------- cleanup range --------->|
>              \/
>  btrfs_finish_ordered_io() range
>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks

> ---
> v6:
>   New, split from v5 patch, as this is a separate bug.
> v7:
>   Fix a wrong operator pointed out by Liu Bo.
>   Test case will follow soon.
> ---
>  fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 39 insertions(+), 12 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index c40060cc481f..fe588bf30d02 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
>                     BTRFS_DATA_RELOC_TREE_OBJECTID) {
>                         ret = btrfs_reloc_clone_csums(inode, start,
>                                                       cur_alloc_size);
> +                       /*
> +                        * Only drop cache here, and process as normal.
> +                        *
> +                        * We must not allow extent_clear_unlock_delalloc()
> +                        * at out_unlock label to free meta of this ordered
> +                        * extent, as its meta should be freed by
> +                        * btrfs_finish_ordered_io().
> +                        *
> +                        * So we must continue until @start is increased to
> +                        * skip current ordered extent.
> +                        */
>                         if (ret)
> -                               goto out_drop_extent_cache;
> +                               btrfs_drop_extent_cache(BTRFS_I(inode), start,
> +                                               start + ram_size - 1, 0);
>                 }
>
>                 btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>
> -               if (disk_num_bytes < cur_alloc_size)
> -                       break;
> -
>                 /* we're not doing compressed IO, don't unlock the first
>                  * page (which the caller expects to stay locked), don't
>                  * clear any dirty bits and don't set any writeback bits
> @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
>                                              delalloc_end, locked_page,
>                                              EXTENT_LOCKED | EXTENT_DELALLOC,
>                                              op);
> -               disk_num_bytes -= cur_alloc_size;
> +               if (disk_num_bytes < cur_alloc_size)
> +                       disk_num_bytes = 0;
> +               else
> +                       disk_num_bytes -= cur_alloc_size;
>                 num_bytes -= cur_alloc_size;
>                 alloc_hint = ins.objectid + ins.offset;
>                 start += cur_alloc_size;
> +
> +               /*
> +                * btrfs_reloc_clone_csums() error, since start is increased
> +                * extent_clear_unlock_delalloc() at out_unlock label won't
> +                * free metadata of current ordered extent, we're OK to exit.
> +                */
> +               if (ret)
> +                       goto out_unlock;
>         }
>  out:
>         return ret;
> @@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>                 BUG_ON(ret); /* -ENOMEM */
>
>                 if (root->root_key.objectid ==
> -                   BTRFS_DATA_RELOC_TREE_OBJECTID) {
> +                   BTRFS_DATA_RELOC_TREE_OBJECTID)
> +                       /*
> +                        * Error handled later, as we must prevent
> +                        * extent_clear_unlock_delalloc() in error handler
> +                        * from freeing metadata of created ordered extent.
> +                        */
>                         ret = btrfs_reloc_clone_csums(inode, cur_offset,
>                                                       num_bytes);
> -                       if (ret) {
> -                               if (!nolock && nocow)
> -                                       btrfs_end_write_no_snapshoting(root);
> -                               goto error;
> -                       }
> -               }
>
>                 extent_clear_unlock_delalloc(inode, cur_offset,
>                                              cur_offset + num_bytes - 1, end,
> @@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>                 if (!nolock && nocow)
>                         btrfs_end_write_no_snapshoting(root);
>                 cur_offset = extent_end;
> +
> +               /*
> +                * btrfs_reloc_clone_csums() error, now we're OK to call error
> +                * handler, as metadata for created ordered extent will only
> +                * be freed by btrfs_finish_ordered_io().
> +                */
> +               if (ret)
> +                       goto error;
>                 if (cur_offset > end)
>                         break;
>         }
> --
> 2.12.0
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo March 9, 2017, 5:35 p.m. UTC | #2
On Wed, Mar 08, 2017 at 10:25:51AM +0800, Qu Wenruo wrote:
> [BUG]
> When btrfs_reloc_clone_csum() reports error, it can underflow metadata
> and leads to kernel assertion on outstanding extents in
> run_delalloc_nocow() and cow_file_range().
> 
>  BTRFS info (device vdb5): relocating block group 12582912 flags data
>  BTRFS info (device vdb5): found 1 extents
>  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
> 
> Currently, due to another bug blocking ordered extents, the bug is only
> reproducible under certain block group layout and using error injection.
> 
> a) Create one data block group with one 4K extent in it.
>    To avoid the bug that hangs btrfs due to ordered extent which never
>    finishes
> b) Make btrfs_reloc_clone_csum() always fail
> c) Relocate that block group
> 
> [CAUSE]
> run_delalloc_nocow() and cow_file_range() handles error from
> btrfs_reloc_clone_csum() wrongly:
> 
> (The ascii chart shows a more generic case of this bug other than the
> bug mentioned above)
> 
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
>                     |<----------- cleanup range --------------->|
> |<-----------  ----------->|
>              \/
>  btrfs_finish_ordered_io() range
> 
> So error handler, which calls extent_clear_unlock_delalloc() with
> EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
> will both cover OE n, and free its metadata, causing metadata under flow.
> 
> [Fix]
> The fix is to ensure after calling btrfs_add_ordered_extent(), we only
> call error handler after increasing the iteration offset, so that
> cleanup range won't cover any created ordered extent.
> 
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
> |<-----------  ----------->|<---------- cleanup range --------->|
>              \/
>  btrfs_finish_ordered_io() range

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Thanks,

-liubo
> 
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
> v6:
>   New, split from v5 patch, as this is a separate bug.
> v7:
>   Fix a wrong operator pointed out by Liu Bo.
>   Test case will follow soon.
> ---
>  fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 39 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index c40060cc481f..fe588bf30d02 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
>  		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
>  			ret = btrfs_reloc_clone_csums(inode, start,
>  						      cur_alloc_size);
> +			/*
> +			 * Only drop cache here, and process as normal.
> +			 *
> +			 * We must not allow extent_clear_unlock_delalloc()
> +			 * at out_unlock label to free meta of this ordered
> +			 * extent, as its meta should be freed by
> +			 * btrfs_finish_ordered_io().
> +			 *
> +			 * So we must continue until @start is increased to
> +			 * skip current ordered extent.
> +			 */
>  			if (ret)
> -				goto out_drop_extent_cache;
> +				btrfs_drop_extent_cache(BTRFS_I(inode), start,
> +						start + ram_size - 1, 0);
>  		}
>  
>  		btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>  
> -		if (disk_num_bytes < cur_alloc_size)
> -			break;
> -
>  		/* we're not doing compressed IO, don't unlock the first
>  		 * page (which the caller expects to stay locked), don't
>  		 * clear any dirty bits and don't set any writeback bits
> @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
>  					     delalloc_end, locked_page,
>  					     EXTENT_LOCKED | EXTENT_DELALLOC,
>  					     op);
> -		disk_num_bytes -= cur_alloc_size;
> +		if (disk_num_bytes < cur_alloc_size)
> +			disk_num_bytes = 0;
> +		else
> +			disk_num_bytes -= cur_alloc_size;
>  		num_bytes -= cur_alloc_size;
>  		alloc_hint = ins.objectid + ins.offset;
>  		start += cur_alloc_size;
> +
> +		/*
> +		 * btrfs_reloc_clone_csums() error, since start is increased
> +		 * extent_clear_unlock_delalloc() at out_unlock label won't
> +		 * free metadata of current ordered extent, we're OK to exit.
> +		 */
> +		if (ret)
> +			goto out_unlock;
>  	}
>  out:
>  	return ret;
> @@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>  		BUG_ON(ret); /* -ENOMEM */
>  
>  		if (root->root_key.objectid ==
> -		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
> +		    BTRFS_DATA_RELOC_TREE_OBJECTID)
> +			/*
> +			 * Error handled later, as we must prevent
> +			 * extent_clear_unlock_delalloc() in error handler
> +			 * from freeing metadata of created ordered extent.
> +			 */
>  			ret = btrfs_reloc_clone_csums(inode, cur_offset,
>  						      num_bytes);
> -			if (ret) {
> -				if (!nolock && nocow)
> -					btrfs_end_write_no_snapshoting(root);
> -				goto error;
> -			}
> -		}
>  
>  		extent_clear_unlock_delalloc(inode, cur_offset,
>  					     cur_offset + num_bytes - 1, end,
> @@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>  		if (!nolock && nocow)
>  			btrfs_end_write_no_snapshoting(root);
>  		cur_offset = extent_end;
> +
> +		/*
> +		 * btrfs_reloc_clone_csums() error, now we're OK to call error
> +		 * handler, as metadata for created ordered extent will only
> +		 * be freed by btrfs_finish_ordered_io().
> +		 */
> +		if (ret)
> +			goto error;
>  		if (cur_offset > end)
>  			break;
>  	}
> -- 
> 2.12.0
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stefan Priebe - Profihost AG March 12, 2017, 8:49 p.m. UTC | #3
Hi Qu,

while V5 was running fine against the openSUSE-42.2 kernel (based on v4.4).

V7 results in OOPS to me:
BUG: unable to handle kernel NULL pointer dereference at 00000000000001f0
IP: [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs]
PGD 14e18d4067 PUD 14e1868067 PMD 0
Oops: 0000 [#1] SMP
Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4
xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set
nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq ata_generic
virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci
i2c_core usb_common ata_piix floppy
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.7.5-20140722_172050-sagunt 04/01/2014
task: ffffffffb4e0f500 ti: ffffffffb4e00000 task.ti: ffffffffb4e00000
RIP: 0010:[<ffffffffc03dde23>] [<ffffffffc03dde23>]
__endio_write_update_ordered+0x33/0x140 [btrfs]
RSP: 0018:ffff8814eae03cd8 EFLAGS: 00010086
RAX: 0000000000000000 RBX: ffff8814e8fd5aa8 RCX: 0000000000000001
RDX: 0000000000100000 RSI: 0000000000100000 RDI: ffff8814e45885c0
RBP: ffff8814eae03d10 R08: ffff8814e8334000 R09: 000000018040003a
R10: ffffea00507d8d00 R11: ffff88141f634080 R12: ffff8814e45885c0
R13: ffff8814e125d700 R14: 0000000000100000 R15: ffff8800376c6a80
FS: 0000000000000000(0000) GS:ffff8814eae00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000001f0 CR3: 00000014e34c9000 CR4: 00000000001406f0Stack:
0000000000000000 0000000000100000 ffff8814e8fd5aa8 ffff8814e953f3c0
ffff8814e125d700 0000000000100000 ffff8800376c6a80 ffff8814eae03d38
ffffffffc03ddf67 ffff8814e86b6a80 ffff8814e8fd5aa8 0000000000000001
Call Trace:
[<ffffffffc03ddf67>] btrfs_endio_direct_write+0x37/0x60 [btrfs]
[<ffffffffb438f2f7>] bio_endio+0x57/0x60
[<ffffffffc04082c1>] btrfs_end_bio+0xa1/0x140 [btrfs]
[<ffffffffb438f2f7>] bio_endio+0x57/0x60
[<ffffffffb439763b>] blk_update_request+0x8b/0x330
[<ffffffffb43a05ba>] blk_mq_end_request+0x1a/0x70
[<ffffffffc039f30f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
[<ffffffffb43a0688>] __blk_mq_complete_request+0x78/0xe0
[<ffffffffb43a070c>] blk_mq_complete_request+0x1c/0x20
[<ffffffffc039f184>] virtblk_done+0x64/0xe0 [virtio_blk]
[<ffffffffb446dd2a>] vring_interrupt+0x3a/0x90
[<ffffffffb40d3fe9>] __handle_irq_event_percpu+0x89/0x1b0
[<ffffffffb40d4133>] handle_irq_event_percpu+0x23/0x60
[<ffffffffb40d41ab>] handle_irq_event+0x3b/0x60
[<ffffffffb40d74ef>] handle_edge_irq+0x6f/0x150
[<ffffffffb4007cad>] handle_irq+0x1d/0x30
[<ffffffffb400750b>] do_IRQ+0x4b/0xd0
[<ffffffffb46af8cc>] common_interrupt+0x8c/0x8c
DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
Leftover inexact backtrace:
2017-03-12 20:33:08     <IRQ><EOI>
2017-03-12 20:33:08      [<ffffffffb404ba46>] ? native_safe_halt+0x6/0x10
[<ffffffffb400fa3e>] default_idle+0x1e/0xe0
[<ffffffffb401021f>] arch_cpu_idle+0xf/0x20
[<ffffffffb40c67eb>] default_idle_call+0x3b/0x40
[<ffffffffb40c6a8a>] cpu_startup_entry+0x29a/0x370
[<ffffffffb46a358c>] rest_init+0x7c/0x80
[<ffffffffb4f67fa5>] start_kernel+0x490/0x49d
[<ffffffffb4f67120>] ? early_idt_handler_array+0x120/0x120
[<ffffffffb4f674b3>] x86_64_start_reservations+0x2a/0x2c
[<ffffffffb4f675f0>] x86_64_start_kernel+0x13b/0x14a
Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc
ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b
b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84
RIP [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs]
RSP <ffff8814eae03cd8>
CR2: 00000000000001f0
---[ end trace 7529a0652fd7873e ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x33000000 from 0xffffffff81000000 (relocation range:
0xffffffff80000000-0xffffffffbfffffff)

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo March 13, 2017, 1:16 a.m. UTC | #4
At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote:
> Hi Qu,
>
> while V5 was running fine against the openSUSE-42.2 kernel (based on v4.4).

Thanks for the test.

>
> V7 results in OOPS to me:
> BUG: unable to handle kernel NULL pointer dereference at 00000000000001f0

This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite 
nice clue.

> IP: [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs]

IP points to:
---
static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode)
{
         struct btrfs_root *root = inode->root; << Either here

         if (root == root->fs_info->tree_root && << Or here
             btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID)

---

Taking the above offset into consideration, it's only possible for later 
case.

So here, we have a btrfs_inode whose @root is NULL.

This can be fixed easily by checking @root inside 
btrfs_is_free_space_inode(), as the backtrace shows that it's only 
happening for DirectIO, and it won't happen for free space cache inode.

But I'm more curious how this happened for a more accurate fix, or we 
could have other NULL pointer access.

Did you have any reproducer for this?

Thanks,
Qu

> PGD 14e18d4067 PUD 14e1868067 PMD 0
> Oops: 0000 [#1] SMP
> Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4
> xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set
> nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq ata_generic
> virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci
> i2c_core usb_common ata_piix floppy
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.7.5-20140722_172050-sagunt 04/01/2014
> task: ffffffffb4e0f500 ti: ffffffffb4e00000 task.ti: ffffffffb4e00000
> RIP: 0010:[<ffffffffc03dde23>] [<ffffffffc03dde23>]
> __endio_write_update_ordered+0x33/0x140 [btrfs]
> RSP: 0018:ffff8814eae03cd8 EFLAGS: 00010086
> RAX: 0000000000000000 RBX: ffff8814e8fd5aa8 RCX: 0000000000000001
> RDX: 0000000000100000 RSI: 0000000000100000 RDI: ffff8814e45885c0
> RBP: ffff8814eae03d10 R08: ffff8814e8334000 R09: 000000018040003a
> R10: ffffea00507d8d00 R11: ffff88141f634080 R12: ffff8814e45885c0
> R13: ffff8814e125d700 R14: 0000000000100000 R15: ffff8800376c6a80
> FS: 0000000000000000(0000) GS:ffff8814eae00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000000001f0 CR3: 00000014e34c9000 CR4: 00000000001406f0Stack:
> 0000000000000000 0000000000100000 ffff8814e8fd5aa8 ffff8814e953f3c0
> ffff8814e125d700 0000000000100000 ffff8800376c6a80 ffff8814eae03d38
> ffffffffc03ddf67 ffff8814e86b6a80 ffff8814e8fd5aa8 0000000000000001
> Call Trace:
> [<ffffffffc03ddf67>] btrfs_endio_direct_write+0x37/0x60 [btrfs]
> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
> [<ffffffffc04082c1>] btrfs_end_bio+0xa1/0x140 [btrfs]
> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
> [<ffffffffb439763b>] blk_update_request+0x8b/0x330
> [<ffffffffb43a05ba>] blk_mq_end_request+0x1a/0x70
> [<ffffffffc039f30f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
> [<ffffffffb43a0688>] __blk_mq_complete_request+0x78/0xe0
> [<ffffffffb43a070c>] blk_mq_complete_request+0x1c/0x20
> [<ffffffffc039f184>] virtblk_done+0x64/0xe0 [virtio_blk]
> [<ffffffffb446dd2a>] vring_interrupt+0x3a/0x90
> [<ffffffffb40d3fe9>] __handle_irq_event_percpu+0x89/0x1b0
> [<ffffffffb40d4133>] handle_irq_event_percpu+0x23/0x60
> [<ffffffffb40d41ab>] handle_irq_event+0x3b/0x60
> [<ffffffffb40d74ef>] handle_edge_irq+0x6f/0x150
> [<ffffffffb4007cad>] handle_irq+0x1d/0x30
> [<ffffffffb400750b>] do_IRQ+0x4b/0xd0
> [<ffffffffb46af8cc>] common_interrupt+0x8c/0x8c
> DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
> Leftover inexact backtrace:
> 2017-03-12 20:33:08     <IRQ><EOI>
> 2017-03-12 20:33:08      [<ffffffffb404ba46>] ? native_safe_halt+0x6/0x10
> [<ffffffffb400fa3e>] default_idle+0x1e/0xe0
> [<ffffffffb401021f>] arch_cpu_idle+0xf/0x20
> [<ffffffffb40c67eb>] default_idle_call+0x3b/0x40
> [<ffffffffb40c6a8a>] cpu_startup_entry+0x29a/0x370
> [<ffffffffb46a358c>] rest_init+0x7c/0x80
> [<ffffffffb4f67fa5>] start_kernel+0x490/0x49d
> [<ffffffffb4f67120>] ? early_idt_handler_array+0x120/0x120
> [<ffffffffb4f674b3>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffffb4f675f0>] x86_64_start_kernel+0x13b/0x14a
> Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc
> ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b
> b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84
> RIP [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs]
> RSP <ffff8814eae03cd8>
> CR2: 00000000000001f0
> ---[ end trace 7529a0652fd7873e ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> Kernel Offset: 0x33000000 from 0xffffffff81000000 (relocation range:
> 0xffffffff80000000-0xffffffffbfffffff)
>
> Greets,
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stefan Priebe - Profihost AG March 13, 2017, 7:26 a.m. UTC | #5
Hi Qu,

Am 13.03.2017 um 02:16 schrieb Qu Wenruo:
> 
> At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote:
>> Hi Qu,
>>
>> while V5 was running fine against the openSUSE-42.2 kernel (based on
>> v4.4).
> 
> Thanks for the test.
> 
>> V7 results in OOPS to me:
>> BUG: unable to handle kernel NULL pointer dereference at 00000000000001f0
> 
> This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite
> nice clue.
> 
>> IP: [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs]
> 
> IP points to:
> ---
> static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode)
> {
>         struct btrfs_root *root = inode->root; << Either here
> 
>         if (root == root->fs_info->tree_root && << Or here
>             btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID)
> 
> ---
> 
> Taking the above offset into consideration, it's only possible for later
> case.
> 
> So here, we have a btrfs_inode whose @root is NULL.

But wasn't this part of the code identical in V5? Why does it only
happen with V7?

> This can be fixed easily by checking @root inside
> btrfs_is_free_space_inode(), as the backtrace shows that it's only
> happening for DirectIO, and it won't happen for free space cache inode.
> 
> But I'm more curious how this happened for a more accurate fix, or we
> could have other NULL pointer access.
> 
> Did you have any reproducer for this?

Sorry no - this is a production MariaDB Server running btrfs with
compress-force=zlib. But if i could test anything i'll do.

Greets,
Stefan

> 
> Thanks,
> Qu
> 
>> PGD 14e18d4067 PUD 14e1868067 PMD 0
>> Oops: 0000 [#1] SMP
>> Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4
>> xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set
>> nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq ata_generic
>> virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci
>> i2c_core usb_common ata_piix floppy
>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>> 1.7.5-20140722_172050-sagunt 04/01/2014
>> task: ffffffffb4e0f500 ti: ffffffffb4e00000 task.ti: ffffffffb4e00000
>> RIP: 0010:[<ffffffffc03dde23>] [<ffffffffc03dde23>]
>> __endio_write_update_ordered+0x33/0x140 [btrfs]
>> RSP: 0018:ffff8814eae03cd8 EFLAGS: 00010086
>> RAX: 0000000000000000 RBX: ffff8814e8fd5aa8 RCX: 0000000000000001
>> RDX: 0000000000100000 RSI: 0000000000100000 RDI: ffff8814e45885c0
>> RBP: ffff8814eae03d10 R08: ffff8814e8334000 R09: 000000018040003a
>> R10: ffffea00507d8d00 R11: ffff88141f634080 R12: ffff8814e45885c0
>> R13: ffff8814e125d700 R14: 0000000000100000 R15: ffff8800376c6a80
>> FS: 0000000000000000(0000) GS:ffff8814eae00000(0000)
>> knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00000000000001f0 CR3: 00000014e34c9000 CR4: 00000000001406f0Stack:
>> 0000000000000000 0000000000100000 ffff8814e8fd5aa8 ffff8814e953f3c0
>> ffff8814e125d700 0000000000100000 ffff8800376c6a80 ffff8814eae03d38
>> ffffffffc03ddf67 ffff8814e86b6a80 ffff8814e8fd5aa8 0000000000000001
>> Call Trace:
>> [<ffffffffc03ddf67>] btrfs_endio_direct_write+0x37/0x60 [btrfs]
>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
>> [<ffffffffc04082c1>] btrfs_end_bio+0xa1/0x140 [btrfs]
>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
>> [<ffffffffb439763b>] blk_update_request+0x8b/0x330
>> [<ffffffffb43a05ba>] blk_mq_end_request+0x1a/0x70
>> [<ffffffffc039f30f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
>> [<ffffffffb43a0688>] __blk_mq_complete_request+0x78/0xe0
>> [<ffffffffb43a070c>] blk_mq_complete_request+0x1c/0x20
>> [<ffffffffc039f184>] virtblk_done+0x64/0xe0 [virtio_blk]
>> [<ffffffffb446dd2a>] vring_interrupt+0x3a/0x90
>> [<ffffffffb40d3fe9>] __handle_irq_event_percpu+0x89/0x1b0
>> [<ffffffffb40d4133>] handle_irq_event_percpu+0x23/0x60
>> [<ffffffffb40d41ab>] handle_irq_event+0x3b/0x60
>> [<ffffffffb40d74ef>] handle_edge_irq+0x6f/0x150
>> [<ffffffffb4007cad>] handle_irq+0x1d/0x30
>> [<ffffffffb400750b>] do_IRQ+0x4b/0xd0
>> [<ffffffffb46af8cc>] common_interrupt+0x8c/0x8c
>> DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
>> Leftover inexact backtrace:
>> 2017-03-12 20:33:08     <IRQ><EOI>
>> 2017-03-12 20:33:08      [<ffffffffb404ba46>] ? native_safe_halt+0x6/0x10
>> [<ffffffffb400fa3e>] default_idle+0x1e/0xe0
>> [<ffffffffb401021f>] arch_cpu_idle+0xf/0x20
>> [<ffffffffb40c67eb>] default_idle_call+0x3b/0x40
>> [<ffffffffb40c6a8a>] cpu_startup_entry+0x29a/0x370
>> [<ffffffffb46a358c>] rest_init+0x7c/0x80
>> [<ffffffffb4f67fa5>] start_kernel+0x490/0x49d
>> [<ffffffffb4f67120>] ? early_idt_handler_array+0x120/0x120
>> [<ffffffffb4f674b3>] x86_64_start_reservations+0x2a/0x2c
>> [<ffffffffb4f675f0>] x86_64_start_kernel+0x13b/0x14a
>> Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc
>> ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b
>> b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84
>> RIP [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs]
>> RSP <ffff8814eae03cd8>
>> CR2: 00000000000001f0
>> ---[ end trace 7529a0652fd7873e ]---
>> Kernel panic - not syncing: Fatal exception in interrupt
>> Kernel Offset: 0x33000000 from 0xffffffff81000000 (relocation range:
>> 0xffffffff80000000-0xffffffffbfffffff)
>>
>> Greets,
>> Stefan
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo March 13, 2017, 7:39 a.m. UTC | #6
At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote:
> Hi Qu,
>
> Am 13.03.2017 um 02:16 schrieb Qu Wenruo:
>>
>> At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote:
>>> Hi Qu,
>>>
>>> while V5 was running fine against the openSUSE-42.2 kernel (based on
>>> v4.4).
>>
>> Thanks for the test.
>>
>>> V7 results in OOPS to me:
>>> BUG: unable to handle kernel NULL pointer dereference at 00000000000001f0
>>
>> This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite
>> nice clue.
>>
>>> IP: [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs]
>>
>> IP points to:
>> ---
>> static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode)
>> {
>>         struct btrfs_root *root = inode->root; << Either here
>>
>>         if (root == root->fs_info->tree_root && << Or here
>>             btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID)
>>
>> ---
>>
>> Taking the above offset into consideration, it's only possible for later
>> case.
>>
>> So here, we have a btrfs_inode whose @root is NULL.
>
> But wasn't this part of the code identical in V5? Why does it only
> happen with V7?

There are still difference, but just as you said, the related 
part(checking if inode is free space cache inode) is identical across v5 
and v7.


I'm afraid that's a rare race leading to NULL btrfs_inode->root, which 
could happen in both v5 and v7.

What's the difference between SUSE and mainline kernel?
Maybe some mainline kernel commits have already fixed it?

Thanks,
Qu
>
>> This can be fixed easily by checking @root inside
>> btrfs_is_free_space_inode(), as the backtrace shows that it's only
>> happening for DirectIO, and it won't happen for free space cache inode.
>>
>> But I'm more curious how this happened for a more accurate fix, or we
>> could have other NULL pointer access.
>>
>> Did you have any reproducer for this?
>
> Sorry no - this is a production MariaDB Server running btrfs with
> compress-force=zlib. But if i could test anything i'll do.
>
> Greets,
> Stefan
>
>>
>> Thanks,
>> Qu
>>
>>> PGD 14e18d4067 PUD 14e1868067 PMD 0
>>> Oops: 0000 [#1] SMP
>>> Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4
>>> xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set
>>> nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq ata_generic
>>> virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci
>>> i2c_core usb_common ata_piix floppy
>>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>> 1.7.5-20140722_172050-sagunt 04/01/2014
>>> task: ffffffffb4e0f500 ti: ffffffffb4e00000 task.ti: ffffffffb4e00000
>>> RIP: 0010:[<ffffffffc03dde23>] [<ffffffffc03dde23>]
>>> __endio_write_update_ordered+0x33/0x140 [btrfs]
>>> RSP: 0018:ffff8814eae03cd8 EFLAGS: 00010086
>>> RAX: 0000000000000000 RBX: ffff8814e8fd5aa8 RCX: 0000000000000001
>>> RDX: 0000000000100000 RSI: 0000000000100000 RDI: ffff8814e45885c0
>>> RBP: ffff8814eae03d10 R08: ffff8814e8334000 R09: 000000018040003a
>>> R10: ffffea00507d8d00 R11: ffff88141f634080 R12: ffff8814e45885c0
>>> R13: ffff8814e125d700 R14: 0000000000100000 R15: ffff8800376c6a80
>>> FS: 0000000000000000(0000) GS:ffff8814eae00000(0000)
>>> knlGS:0000000000000000
>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 00000000000001f0 CR3: 00000014e34c9000 CR4: 00000000001406f0Stack:
>>> 0000000000000000 0000000000100000 ffff8814e8fd5aa8 ffff8814e953f3c0
>>> ffff8814e125d700 0000000000100000 ffff8800376c6a80 ffff8814eae03d38
>>> ffffffffc03ddf67 ffff8814e86b6a80 ffff8814e8fd5aa8 0000000000000001
>>> Call Trace:
>>> [<ffffffffc03ddf67>] btrfs_endio_direct_write+0x37/0x60 [btrfs]
>>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
>>> [<ffffffffc04082c1>] btrfs_end_bio+0xa1/0x140 [btrfs]
>>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
>>> [<ffffffffb439763b>] blk_update_request+0x8b/0x330
>>> [<ffffffffb43a05ba>] blk_mq_end_request+0x1a/0x70
>>> [<ffffffffc039f30f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
>>> [<ffffffffb43a0688>] __blk_mq_complete_request+0x78/0xe0
>>> [<ffffffffb43a070c>] blk_mq_complete_request+0x1c/0x20
>>> [<ffffffffc039f184>] virtblk_done+0x64/0xe0 [virtio_blk]
>>> [<ffffffffb446dd2a>] vring_interrupt+0x3a/0x90
>>> [<ffffffffb40d3fe9>] __handle_irq_event_percpu+0x89/0x1b0
>>> [<ffffffffb40d4133>] handle_irq_event_percpu+0x23/0x60
>>> [<ffffffffb40d41ab>] handle_irq_event+0x3b/0x60
>>> [<ffffffffb40d74ef>] handle_edge_irq+0x6f/0x150
>>> [<ffffffffb4007cad>] handle_irq+0x1d/0x30
>>> [<ffffffffb400750b>] do_IRQ+0x4b/0xd0
>>> [<ffffffffb46af8cc>] common_interrupt+0x8c/0x8c
>>> DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
>>> Leftover inexact backtrace:
>>> 2017-03-12 20:33:08     <IRQ><EOI>
>>> 2017-03-12 20:33:08      [<ffffffffb404ba46>] ? native_safe_halt+0x6/0x10
>>> [<ffffffffb400fa3e>] default_idle+0x1e/0xe0
>>> [<ffffffffb401021f>] arch_cpu_idle+0xf/0x20
>>> [<ffffffffb40c67eb>] default_idle_call+0x3b/0x40
>>> [<ffffffffb40c6a8a>] cpu_startup_entry+0x29a/0x370
>>> [<ffffffffb46a358c>] rest_init+0x7c/0x80
>>> [<ffffffffb4f67fa5>] start_kernel+0x490/0x49d
>>> [<ffffffffb4f67120>] ? early_idt_handler_array+0x120/0x120
>>> [<ffffffffb4f674b3>] x86_64_start_reservations+0x2a/0x2c
>>> [<ffffffffb4f675f0>] x86_64_start_kernel+0x13b/0x14a
>>> Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc
>>> ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b
>>> b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84
>>> RIP [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs]
>>> RSP <ffff8814eae03cd8>
>>> CR2: 00000000000001f0
>>> ---[ end trace 7529a0652fd7873e ]---
>>> Kernel panic - not syncing: Fatal exception in interrupt
>>> Kernel Offset: 0x33000000 from 0xffffffff81000000 (relocation range:
>>> 0xffffffff80000000-0xffffffffbfffffff)
>>>
>>> Greets,
>>> Stefan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>
>>
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stefan Priebe - Profihost AG March 13, 2017, 1:26 p.m. UTC | #7
Am 13.03.2017 um 08:39 schrieb Qu Wenruo:
> 
> 
> At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote:
>> Hi Qu,
>>
>> Am 13.03.2017 um 02:16 schrieb Qu Wenruo:
>>>
>>> At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote:
>>>> Hi Qu,
>>>>
>>>> while V5 was running fine against the openSUSE-42.2 kernel (based on
>>>> v4.4).
>>>
>>> Thanks for the test.
>>>
>>>> V7 results in OOPS to me:
>>>> BUG: unable to handle kernel NULL pointer dereference at
>>>> 00000000000001f0
>>>
>>> This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite
>>> nice clue.
>>>
>>>> IP: [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140
>>>> [btrfs]
>>>
>>> IP points to:
>>> ---
>>> static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode)
>>> {
>>>         struct btrfs_root *root = inode->root; << Either here
>>>
>>>         if (root == root->fs_info->tree_root && << Or here
>>>             btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID)
>>>
>>> ---
>>>
>>> Taking the above offset into consideration, it's only possible for later
>>> case.
>>>
>>> So here, we have a btrfs_inode whose @root is NULL.
>>
>> But wasn't this part of the code identical in V5? Why does it only
>> happen with V7?
> 
> There are still difference, but just as you said, the related
> part(checking if inode is free space cache inode) is identical across v5
> and v7.

But if i boot v7 it always happens. If i boot v5 it always works. Have
done 5 repeatet tests.

> I'm afraid that's a rare race leading to NULL btrfs_inode->root, which
> could happen in both v5 and v7.
> 
> What's the difference between SUSE and mainline kernel?

A lot ;-) But i don't think anything related.

> Maybe some mainline kernel commits have already fixed it?

May be no idea. But i haven't found any reason why v5 works.

Stefan

> 
> Thanks,
> Qu
>>
>>> This can be fixed easily by checking @root inside
>>> btrfs_is_free_space_inode(), as the backtrace shows that it's only
>>> happening for DirectIO, and it won't happen for free space cache inode.
>>>
>>> But I'm more curious how this happened for a more accurate fix, or we
>>> could have other NULL pointer access.
>>>
>>> Did you have any reproducer for this?
>>
>> Sorry no - this is a production MariaDB Server running btrfs with
>> compress-force=zlib. But if i could test anything i'll do.
>>
>> Greets,
>> Stefan
>>
>>>
>>> Thanks,
>>> Qu
>>>
>>>> PGD 14e18d4067 PUD 14e1868067 PMD 0
>>>> Oops: 0000 [#1] SMP
>>>> Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4
>>>> xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set
>>>> nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq
>>>> ata_generic
>>>> virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci
>>>> i2c_core usb_common ata_piix floppy
>>>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1
>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>>> 1.7.5-20140722_172050-sagunt 04/01/2014
>>>> task: ffffffffb4e0f500 ti: ffffffffb4e00000 task.ti: ffffffffb4e00000
>>>> RIP: 0010:[<ffffffffc03dde23>] [<ffffffffc03dde23>]
>>>> __endio_write_update_ordered+0x33/0x140 [btrfs]
>>>> RSP: 0018:ffff8814eae03cd8 EFLAGS: 00010086
>>>> RAX: 0000000000000000 RBX: ffff8814e8fd5aa8 RCX: 0000000000000001
>>>> RDX: 0000000000100000 RSI: 0000000000100000 RDI: ffff8814e45885c0
>>>> RBP: ffff8814eae03d10 R08: ffff8814e8334000 R09: 000000018040003a
>>>> R10: ffffea00507d8d00 R11: ffff88141f634080 R12: ffff8814e45885c0
>>>> R13: ffff8814e125d700 R14: 0000000000100000 R15: ffff8800376c6a80
>>>> FS: 0000000000000000(0000) GS:ffff8814eae00000(0000)
>>>> knlGS:0000000000000000
>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: 00000000000001f0 CR3: 00000014e34c9000 CR4: 00000000001406f0Stack:
>>>> 0000000000000000 0000000000100000 ffff8814e8fd5aa8 ffff8814e953f3c0
>>>> ffff8814e125d700 0000000000100000 ffff8800376c6a80 ffff8814eae03d38
>>>> ffffffffc03ddf67 ffff8814e86b6a80 ffff8814e8fd5aa8 0000000000000001
>>>> Call Trace:
>>>> [<ffffffffc03ddf67>] btrfs_endio_direct_write+0x37/0x60 [btrfs]
>>>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
>>>> [<ffffffffc04082c1>] btrfs_end_bio+0xa1/0x140 [btrfs]
>>>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
>>>> [<ffffffffb439763b>] blk_update_request+0x8b/0x330
>>>> [<ffffffffb43a05ba>] blk_mq_end_request+0x1a/0x70
>>>> [<ffffffffc039f30f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
>>>> [<ffffffffb43a0688>] __blk_mq_complete_request+0x78/0xe0
>>>> [<ffffffffb43a070c>] blk_mq_complete_request+0x1c/0x20
>>>> [<ffffffffc039f184>] virtblk_done+0x64/0xe0 [virtio_blk]
>>>> [<ffffffffb446dd2a>] vring_interrupt+0x3a/0x90
>>>> [<ffffffffb40d3fe9>] __handle_irq_event_percpu+0x89/0x1b0
>>>> [<ffffffffb40d4133>] handle_irq_event_percpu+0x23/0x60
>>>> [<ffffffffb40d41ab>] handle_irq_event+0x3b/0x60
>>>> [<ffffffffb40d74ef>] handle_edge_irq+0x6f/0x150
>>>> [<ffffffffb4007cad>] handle_irq+0x1d/0x30
>>>> [<ffffffffb400750b>] do_IRQ+0x4b/0xd0
>>>> [<ffffffffb46af8cc>] common_interrupt+0x8c/0x8c
>>>> DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
>>>> Leftover inexact backtrace:
>>>> 2017-03-12 20:33:08     <IRQ><EOI>
>>>> 2017-03-12 20:33:08      [<ffffffffb404ba46>] ?
>>>> native_safe_halt+0x6/0x10
>>>> [<ffffffffb400fa3e>] default_idle+0x1e/0xe0
>>>> [<ffffffffb401021f>] arch_cpu_idle+0xf/0x20
>>>> [<ffffffffb40c67eb>] default_idle_call+0x3b/0x40
>>>> [<ffffffffb40c6a8a>] cpu_startup_entry+0x29a/0x370
>>>> [<ffffffffb46a358c>] rest_init+0x7c/0x80
>>>> [<ffffffffb4f67fa5>] start_kernel+0x490/0x49d
>>>> [<ffffffffb4f67120>] ? early_idt_handler_array+0x120/0x120
>>>> [<ffffffffb4f674b3>] x86_64_start_reservations+0x2a/0x2c
>>>> [<ffffffffb4f675f0>] x86_64_start_kernel+0x13b/0x14a
>>>> Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc
>>>> ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b
>>>> b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84
>>>> RIP [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140
>>>> [btrfs]
>>>> RSP <ffff8814eae03cd8>
>>>> CR2: 00000000000001f0
>>>> ---[ end trace 7529a0652fd7873e ]---
>>>> Kernel panic - not syncing: Fatal exception in interrupt
>>>> Kernel Offset: 0x33000000 from 0xffffffff81000000 (relocation range:
>>>> 0xffffffff80000000-0xffffffffbfffffff)
>>>>
>>>> Greets,
>>>> Stefan
>>>> -- 
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> linux-btrfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>
>>>
>>
>>
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo March 14, 2017, 12:30 a.m. UTC | #8
At 03/13/2017 09:26 PM, Stefan Priebe - Profihost AG wrote:
>
> Am 13.03.2017 um 08:39 schrieb Qu Wenruo:
>>
>>
>> At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote:
>>> Hi Qu,
>>>
>>> Am 13.03.2017 um 02:16 schrieb Qu Wenruo:
>>>>
>>>> At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote:
>>>>> Hi Qu,
>>>>>
>>>>> while V5 was running fine against the openSUSE-42.2 kernel (based on
>>>>> v4.4).
>>>>
>>>> Thanks for the test.
>>>>
>>>>> V7 results in OOPS to me:
>>>>> BUG: unable to handle kernel NULL pointer dereference at
>>>>> 00000000000001f0
>>>>
>>>> This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite
>>>> nice clue.
>>>>
>>>>> IP: [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140
>>>>> [btrfs]
>>>>
>>>> IP points to:
>>>> ---
>>>> static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode)
>>>> {
>>>>         struct btrfs_root *root = inode->root; << Either here
>>>>
>>>>         if (root == root->fs_info->tree_root && << Or here
>>>>             btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID)
>>>>
>>>> ---
>>>>
>>>> Taking the above offset into consideration, it's only possible for later
>>>> case.
>>>>
>>>> So here, we have a btrfs_inode whose @root is NULL.
>>>
>>> But wasn't this part of the code identical in V5? Why does it only
>>> happen with V7?
>>
>> There are still difference, but just as you said, the related
>> part(checking if inode is free space cache inode) is identical across v5
>> and v7.
>
> But if i boot v7 it always happens. If i boot v5 it always works. Have
> done 5 repeatet tests.

That's a problem now, I'll dig it further to find out the difference.

Thanks,
Qu

>
>> I'm afraid that's a rare race leading to NULL btrfs_inode->root, which
>> could happen in both v5 and v7.
>>
>> What's the difference between SUSE and mainline kernel?
>
> A lot ;-) But i don't think anything related.
>
>> Maybe some mainline kernel commits have already fixed it?
>
> May be no idea. But i haven't found any reason why v5 works.
>
> Stefan
>
>>
>> Thanks,
>> Qu
>>>
>>>> This can be fixed easily by checking @root inside
>>>> btrfs_is_free_space_inode(), as the backtrace shows that it's only
>>>> happening for DirectIO, and it won't happen for free space cache inode.
>>>>
>>>> But I'm more curious how this happened for a more accurate fix, or we
>>>> could have other NULL pointer access.
>>>>
>>>> Did you have any reproducer for this?
>>>
>>> Sorry no - this is a production MariaDB Server running btrfs with
>>> compress-force=zlib. But if i could test anything i'll do.
>>>
>>> Greets,
>>> Stefan
>>>
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>> PGD 14e18d4067 PUD 14e1868067 PMD 0
>>>>> Oops: 0000 [#1] SMP
>>>>> Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4
>>>>> xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set
>>>>> nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq
>>>>> ata_generic
>>>>> virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci
>>>>> i2c_core usb_common ata_piix floppy
>>>>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1
>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>>>> 1.7.5-20140722_172050-sagunt 04/01/2014
>>>>> task: ffffffffb4e0f500 ti: ffffffffb4e00000 task.ti: ffffffffb4e00000
>>>>> RIP: 0010:[<ffffffffc03dde23>] [<ffffffffc03dde23>]
>>>>> __endio_write_update_ordered+0x33/0x140 [btrfs]
>>>>> RSP: 0018:ffff8814eae03cd8 EFLAGS: 00010086
>>>>> RAX: 0000000000000000 RBX: ffff8814e8fd5aa8 RCX: 0000000000000001
>>>>> RDX: 0000000000100000 RSI: 0000000000100000 RDI: ffff8814e45885c0
>>>>> RBP: ffff8814eae03d10 R08: ffff8814e8334000 R09: 000000018040003a
>>>>> R10: ffffea00507d8d00 R11: ffff88141f634080 R12: ffff8814e45885c0
>>>>> R13: ffff8814e125d700 R14: 0000000000100000 R15: ffff8800376c6a80
>>>>> FS: 0000000000000000(0000) GS:ffff8814eae00000(0000)
>>>>> knlGS:0000000000000000
>>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> CR2: 00000000000001f0 CR3: 00000014e34c9000 CR4: 00000000001406f0Stack:
>>>>> 0000000000000000 0000000000100000 ffff8814e8fd5aa8 ffff8814e953f3c0
>>>>> ffff8814e125d700 0000000000100000 ffff8800376c6a80 ffff8814eae03d38
>>>>> ffffffffc03ddf67 ffff8814e86b6a80 ffff8814e8fd5aa8 0000000000000001
>>>>> Call Trace:
>>>>> [<ffffffffc03ddf67>] btrfs_endio_direct_write+0x37/0x60 [btrfs]
>>>>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
>>>>> [<ffffffffc04082c1>] btrfs_end_bio+0xa1/0x140 [btrfs]
>>>>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60
>>>>> [<ffffffffb439763b>] blk_update_request+0x8b/0x330
>>>>> [<ffffffffb43a05ba>] blk_mq_end_request+0x1a/0x70
>>>>> [<ffffffffc039f30f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
>>>>> [<ffffffffb43a0688>] __blk_mq_complete_request+0x78/0xe0
>>>>> [<ffffffffb43a070c>] blk_mq_complete_request+0x1c/0x20
>>>>> [<ffffffffc039f184>] virtblk_done+0x64/0xe0 [virtio_blk]
>>>>> [<ffffffffb446dd2a>] vring_interrupt+0x3a/0x90
>>>>> [<ffffffffb40d3fe9>] __handle_irq_event_percpu+0x89/0x1b0
>>>>> [<ffffffffb40d4133>] handle_irq_event_percpu+0x23/0x60
>>>>> [<ffffffffb40d41ab>] handle_irq_event+0x3b/0x60
>>>>> [<ffffffffb40d74ef>] handle_edge_irq+0x6f/0x150
>>>>> [<ffffffffb4007cad>] handle_irq+0x1d/0x30
>>>>> [<ffffffffb400750b>] do_IRQ+0x4b/0xd0
>>>>> [<ffffffffb46af8cc>] common_interrupt+0x8c/0x8c
>>>>> DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
>>>>> Leftover inexact backtrace:
>>>>> 2017-03-12 20:33:08     <IRQ><EOI>
>>>>> 2017-03-12 20:33:08      [<ffffffffb404ba46>] ?
>>>>> native_safe_halt+0x6/0x10
>>>>> [<ffffffffb400fa3e>] default_idle+0x1e/0xe0
>>>>> [<ffffffffb401021f>] arch_cpu_idle+0xf/0x20
>>>>> [<ffffffffb40c67eb>] default_idle_call+0x3b/0x40
>>>>> [<ffffffffb40c6a8a>] cpu_startup_entry+0x29a/0x370
>>>>> [<ffffffffb46a358c>] rest_init+0x7c/0x80
>>>>> [<ffffffffb4f67fa5>] start_kernel+0x490/0x49d
>>>>> [<ffffffffb4f67120>] ? early_idt_handler_array+0x120/0x120
>>>>> [<ffffffffb4f674b3>] x86_64_start_reservations+0x2a/0x2c
>>>>> [<ffffffffb4f675f0>] x86_64_start_kernel+0x13b/0x14a
>>>>> Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc
>>>>> ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b
>>>>> b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84
>>>>> RIP [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140
>>>>> [btrfs]
>>>>> RSP <ffff8814eae03cd8>
>>>>> CR2: 00000000000001f0
>>>>> ---[ end trace 7529a0652fd7873e ]---
>>>>> Kernel panic - not syncing: Fatal exception in interrupt
>>>>> Kernel Offset: 0x33000000 from 0xffffffff81000000 (relocation range:
>>>>> 0xffffffff80000000-0xffffffffbfffffff)
>>>>>
>>>>> Greets,
>>>>> Stefan
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>> linux-btrfs" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo March 14, 2017, 2:50 a.m. UTC | #9
At 03/13/2017 09:26 PM, Stefan Priebe - Profihost AG wrote:
>
> Am 13.03.2017 um 08:39 schrieb Qu Wenruo:
>>
>>
>> At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote:
>>> Hi Qu,
>>>
>>> Am 13.03.2017 um 02:16 schrieb Qu Wenruo:
>>>
>>> But wasn't this part of the code identical in V5? Why does it only
>>> happen with V7?
>>
>> There are still difference, but just as you said, the related
>> part(checking if inode is free space cache inode) is identical across v5
>> and v7.
>
> But if i boot v7 it always happens. If i boot v5 it always works. Have
> done 5 repeatet tests.

I rechecked the code change between v7 and v5.

It turns out that, the code base may cause the problem.

In v7, the base is v4.11-rc1, which introduced quite a lot of 
btrfs_inode cleanup.

One of the difference is the parameter for btrfs_is_free_space_inode().

In v7, the parameter @inode changed from struct inode to struct btrfs_inode.

So in v7, we're passing BTRFS_I(inode) to btrfs_is_free_space_inode(), 
other than plain inode.

That's the most possible cause for me here.

So would you please paste the final patch applied to your tree?
Git diff or git format-patch can both handle it.

Thanks,
Qu

>
>> I'm afraid that's a rare race leading to NULL btrfs_inode->root, which
>> could happen in both v5 and v7.
>>
>> What's the difference between SUSE and mainline kernel?
>
> A lot ;-) But i don't think anything related.
>
>> Maybe some mainline kernel commits have already fixed it?
>
> May be no idea. But i haven't found any reason why v5 works.
>
> Stefan
>
>>
>> Thanks,
>> Qu
>>>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stefan Priebe - Profihost AG March 14, 2017, 9:06 a.m. UTC | #10
Thanks Qu, removing BTRFS_I from the inode fixes this issue to me.

Greets,
Stefan


Am 14.03.2017 um 03:50 schrieb Qu Wenruo:
> 
> 
> At 03/13/2017 09:26 PM, Stefan Priebe - Profihost AG wrote:
>>
>> Am 13.03.2017 um 08:39 schrieb Qu Wenruo:
>>>
>>>
>>> At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote:
>>>> Hi Qu,
>>>>
>>>> Am 13.03.2017 um 02:16 schrieb Qu Wenruo:
>>>>
>>>> But wasn't this part of the code identical in V5? Why does it only
>>>> happen with V7?
>>>
>>> There are still difference, but just as you said, the related
>>> part(checking if inode is free space cache inode) is identical across v5
>>> and v7.
>>
>> But if i boot v7 it always happens. If i boot v5 it always works. Have
>> done 5 repeatet tests.
> 
> I rechecked the code change between v7 and v5.
> 
> It turns out that, the code base may cause the problem.
> 
> In v7, the base is v4.11-rc1, which introduced quite a lot of
> btrfs_inode cleanup.
> 
> One of the difference is the parameter for btrfs_is_free_space_inode().
> 
> In v7, the parameter @inode changed from struct inode to struct
> btrfs_inode.
> 
> So in v7, we're passing BTRFS_I(inode) to btrfs_is_free_space_inode(),
> other than plain inode.
> 
> That's the most possible cause for me here.
> 
> So would you please paste the final patch applied to your tree?
> Git diff or git format-patch can both handle it.
> 
> Thanks,
> Qu
> 
>>
>>> I'm afraid that's a rare race leading to NULL btrfs_inode->root, which
>>> could happen in both v5 and v7.
>>>
>>> What's the difference between SUSE and mainline kernel?
>>
>> A lot ;-) But i don't think anything related.
>>
>>> Maybe some mainline kernel commits have already fixed it?
>>
>> May be no idea. But i haven't found any reason why v5 works.
>>
>> Stefan
>>
>>>
>>> Thanks,
>>> Qu
>>>>
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo March 14, 2017, 9:09 a.m. UTC | #11
At 03/14/2017 05:06 PM, Stefan Priebe - Profihost AG wrote:
> Thanks Qu, removing BTRFS_I from the inode fixes this issue to me.
>
> Greets,
> Stefan
>

Glad to hear that.

And a small tip is, when compiling kernel(at least btrfs module), any 
warning should be checked carefully.

Such type mismatch will cause a warning and can help us to avoid such 
problem.

Thanks,
Qu
>
> Am 14.03.2017 um 03:50 schrieb Qu Wenruo:
>>
>>
>> At 03/13/2017 09:26 PM, Stefan Priebe - Profihost AG wrote:
>>>
>>> Am 13.03.2017 um 08:39 schrieb Qu Wenruo:
>>>>
>>>>
>>>> At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote:
>>>>> Hi Qu,
>>>>>
>>>>> Am 13.03.2017 um 02:16 schrieb Qu Wenruo:
>>>>>
>>>>> But wasn't this part of the code identical in V5? Why does it only
>>>>> happen with V7?
>>>>
>>>> There are still difference, but just as you said, the related
>>>> part(checking if inode is free space cache inode) is identical across v5
>>>> and v7.
>>>
>>> But if i boot v7 it always happens. If i boot v5 it always works. Have
>>> done 5 repeatet tests.
>>
>> I rechecked the code change between v7 and v5.
>>
>> It turns out that, the code base may cause the problem.
>>
>> In v7, the base is v4.11-rc1, which introduced quite a lot of
>> btrfs_inode cleanup.
>>
>> One of the difference is the parameter for btrfs_is_free_space_inode().
>>
>> In v7, the parameter @inode changed from struct inode to struct
>> btrfs_inode.
>>
>> So in v7, we're passing BTRFS_I(inode) to btrfs_is_free_space_inode(),
>> other than plain inode.
>>
>> That's the most possible cause for me here.
>>
>> So would you please paste the final patch applied to your tree?
>> Git diff or git format-patch can both handle it.
>>
>> Thanks,
>> Qu
>>
>>>
>>>> I'm afraid that's a rare race leading to NULL btrfs_inode->root, which
>>>> could happen in both v5 and v7.
>>>>
>>>> What's the difference between SUSE and mainline kernel?
>>>
>>> A lot ;-) But i don't think anything related.
>>>
>>>> Maybe some mainline kernel commits have already fixed it?
>>>
>>> May be no idea. But i haven't found any reason why v5 works.
>>>
>>> Stefan
>>>
>>>>
>>>> Thanks,
>>>> Qu
>>>>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c40060cc481f..fe588bf30d02 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -998,15 +998,24 @@  static noinline int cow_file_range(struct inode *inode,
 		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
 			ret = btrfs_reloc_clone_csums(inode, start,
 						      cur_alloc_size);
+			/*
+			 * Only drop cache here, and process as normal.
+			 *
+			 * We must not allow extent_clear_unlock_delalloc()
+			 * at out_unlock label to free meta of this ordered
+			 * extent, as its meta should be freed by
+			 * btrfs_finish_ordered_io().
+			 *
+			 * So we must continue until @start is increased to
+			 * skip current ordered extent.
+			 */
 			if (ret)
-				goto out_drop_extent_cache;
+				btrfs_drop_extent_cache(BTRFS_I(inode), start,
+						start + ram_size - 1, 0);
 		}
 
 		btrfs_dec_block_group_reservations(fs_info, ins.objectid);
 
-		if (disk_num_bytes < cur_alloc_size)
-			break;
-
 		/* we're not doing compressed IO, don't unlock the first
 		 * page (which the caller expects to stay locked), don't
 		 * clear any dirty bits and don't set any writeback bits
@@ -1022,10 +1031,21 @@  static noinline int cow_file_range(struct inode *inode,
 					     delalloc_end, locked_page,
 					     EXTENT_LOCKED | EXTENT_DELALLOC,
 					     op);
-		disk_num_bytes -= cur_alloc_size;
+		if (disk_num_bytes < cur_alloc_size)
+			disk_num_bytes = 0;
+		else
+			disk_num_bytes -= cur_alloc_size;
 		num_bytes -= cur_alloc_size;
 		alloc_hint = ins.objectid + ins.offset;
 		start += cur_alloc_size;
+
+		/*
+		 * btrfs_reloc_clone_csums() error, since start is increased
+		 * extent_clear_unlock_delalloc() at out_unlock label won't
+		 * free metadata of current ordered extent, we're OK to exit.
+		 */
+		if (ret)
+			goto out_unlock;
 	}
 out:
 	return ret;
@@ -1414,15 +1434,14 @@  static noinline int run_delalloc_nocow(struct inode *inode,
 		BUG_ON(ret); /* -ENOMEM */
 
 		if (root->root_key.objectid ==
-		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
+		    BTRFS_DATA_RELOC_TREE_OBJECTID)
+			/*
+			 * Error handled later, as we must prevent
+			 * extent_clear_unlock_delalloc() in error handler
+			 * from freeing metadata of created ordered extent.
+			 */
 			ret = btrfs_reloc_clone_csums(inode, cur_offset,
 						      num_bytes);
-			if (ret) {
-				if (!nolock && nocow)
-					btrfs_end_write_no_snapshoting(root);
-				goto error;
-			}
-		}
 
 		extent_clear_unlock_delalloc(inode, cur_offset,
 					     cur_offset + num_bytes - 1, end,
@@ -1434,6 +1453,14 @@  static noinline int run_delalloc_nocow(struct inode *inode,
 		if (!nolock && nocow)
 			btrfs_end_write_no_snapshoting(root);
 		cur_offset = extent_end;
+
+		/*
+		 * btrfs_reloc_clone_csums() error, now we're OK to call error
+		 * handler, as metadata for created ordered extent will only
+		 * be freed by btrfs_finish_ordered_io().
+		 */
+		if (ret)
+			goto error;
 		if (cur_offset > end)
 			break;
 	}