Ongoing Btrfs stability issues
diff mbox

Message ID 20180302172951.GC30920@dhcp-10-211-47-181.usdhcp.oraclecorp.com
State New
Headers show

Commit Message

Liu Bo March 2, 2018, 5:29 p.m. UTC
On Thu, Mar 01, 2018 at 09:40:41PM +0200, Nikolay Borisov wrote:
> 
> 
> On  1.03.2018 21:04, Alex Adriaanse wrote:
> > On Feb 16, 2018, at 1:44 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
...
> <snip>
> > [496003.641729] BTRFS: error (device xvdc) in __btrfs_free_extent:7076: errno=-28 No space left
> > [496003.641994] BTRFS: error (device xvdc) in btrfs_drop_snapshot:9332: errno=-28 No space left
> > [496003.641996] BTRFS info (device xvdc): forced readonly
> > [496003.641998] BTRFS: error (device xvdc) in merge_reloc_roots:2470: errno=-28 No space left
> > [496003.642060] BUG: unable to handle kernel NULL pointer dereference at           (null)
> > [496003.642086] IP: __del_reloc_root+0x3c/0x100 [btrfs]
> > [496003.642087] PGD 80000005fe08c067 P4D 80000005fe08c067 PUD 3bd2f4067 PMD 0
> > [496003.642091] Oops: 0000 [#1] SMP PTI
> > [496003.642093] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev intel_rapl_perf serio_raw parport_pc parport evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm aesni_intel aes_x86_64 crypto_simd drm_kms_helper cryptd glue_helper ena psmouse drm scsi_mod i2c_piix4 button
> > [496003.642128] CPU: 1 PID: 25327 Comm: btrfs Tainted: G        W       4.14.0-0.bpo.3-amd64 #1 Debian 4.14.13-1~bpo9+1
> > [496003.642129] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
> > [496003.642130] task: ffff8fbffb8dd080 task.stack: ffff9e81c7b8c000
> > [496003.642149] RIP: 0010:__del_reloc_root+0x3c/0x100 [btrfs]
> 
> 
> if you happen to have the vmlinux of that kernel can you run the
> following from the kernel source directory:
> 
> ./scripts/faddr2line  __del_reloc_root+0x3c/0x100 vmlinux
>

I thought this was fixed by bb166d7 btrfs: fix NULL pointer dereference from free_reloc_roots(),
Alex, do you mind checking if it's included in your kernel?

You can also check if the following change is merged in kernel-src deb.


Thanks,

-liubo

> 
> > [496003.642151] RSP: 0018:ffff9e81c7b8fab0 EFLAGS: 00010286
> > [496003.642153] RAX: 0000000000000000 RBX: ffff8fb90a10a3c0 RCX: ffffca5d1fda5a5f
> > [496003.642154] RDX: 0000000000000001 RSI: ffff8fc05eae62c0 RDI: ffff8fbc4fd87d70
> > [496003.642154] RBP: ffff8fbbb5139000 R08: 0000000000000000 R09: 0000000000000000
> > [496003.642155] R10: ffff8fc05eae62c0 R11: 00000000000001bc R12: ffff8fc0fbeac000
> > [496003.642156] R13: ffff8fbc4fd87d70 R14: ffff8fbc4fd87800 R15: 00000000ffffffe4
> > [496003.642157] FS:  00007f64196708c0(0000) GS:ffff8fc100a40000(0000) knlGS:0000000000000000
> > [496003.642159] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [496003.642160] CR2: 0000000000000000 CR3: 000000069b972004 CR4: 00000000001606e0
> > [496003.642162] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [496003.642163] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [496003.642164] Call Trace:
> > [496003.642185]  free_reloc_roots+0x22/0x60 [btrfs]
> > [496003.642202]  merge_reloc_roots+0x184/0x260 [btrfs]
> > [496003.642217]  relocate_block_group+0x29a/0x610 [btrfs]
> > [496003.642232]  btrfs_relocate_block_group+0x17b/0x230 [btrfs]
> > [496003.642254]  btrfs_relocate_chunk+0x38/0xb0 [btrfs]
> > [496003.642272]  btrfs_balance+0xa15/0x1250 [btrfs]
> > [496003.642292]  btrfs_ioctl_balance+0x368/0x380 [btrfs]
> > [496003.642309]  btrfs_ioctl+0x1170/0x24e0 [btrfs]
> > [496003.642312]  ? mem_cgroup_try_charge+0x86/0x1a0
> > [496003.642315]  ? __handle_mm_fault+0x640/0x10e0
> > [496003.642318]  ? do_vfs_ioctl+0x9f/0x600
> > [496003.642319]  do_vfs_ioctl+0x9f/0x600
> > [496003.642321]  ? handle_mm_fault+0xc6/0x1b0
> > [496003.642325]  ? __do_page_fault+0x289/0x500
> > [496003.642327]  SyS_ioctl+0x74/0x80
> > [496003.642330]  system_call_fast_compare_end+0xc/0x6f
> > [496003.642332] RIP: 0033:0x7f64186f8e07
> > [496003.642333] RSP: 002b:00007ffcdf69d1b8 EFLAGS: 00000206
> > [496003.642334] Code: 8b a7 f0 01 00 00 4d 8b b4 24 40 14 00 00 4d 8d ae 70 05 00 00 4c 89 ef e8 c2 b9 3e c2 49 8b 9e 68 05 00 00 48 8b 45 00 48 85 db <48> 8b 10 75 0e e9 ad 00 00 00 48 8b 5b 10 48 85 db 74 11 48 3b
> > [496003.642376] RIP: __del_reloc_root+0x3c/0x100 [btrfs] RSP: ffff9e81c7b8fab0
> > [496003.642377] CR2: 0000000000000000
> > [496003.642393] ---[ end trace 6f05416539a50c4e ]---
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Alex Adriaanse March 8, 2018, 5:40 p.m. UTC | #1
On Mar 2, 2018, at 11:29 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Thu, Mar 01, 2018 at 09:40:41PM +0200, Nikolay Borisov wrote:
>> On  1.03.2018 21:04, Alex Adriaanse wrote:
>>> Thanks so much for the suggestions so far, everyone. I wanted to report back on this. Last Friday I made the following changes per suggestions from this thread:
>>> 
>>> 1. Change the nightly balance to the following:
>>> 
>>>    btrfs balance start -dusage=20 <fs>
>>>    btrfs balance start -dusage=40,limit=10 <fs>
>>>    btrfs balance start -musage=30 <fs>
>>> 
>>> 2. Upgrade kernels for all VMs to 4.14.13-1~bpo9+1, which contains the SSD space allocation fix.
>>> 
>>> 3. Boot Linux with the elevator=noop option
>>> 
>>> 4. Change /sys/block/xvd*/queue/scheduler to "none"
>>> 
>>> 5. Mount all our Btrfs filesystems with the "enospc_debug" option.
>> 
>> SO that's good, however you didn't apply the out of tree patch (it has
>> already been merged into the for-next so will likely land in 4.17) I
>> pointed you at. As a result when you your ENOSPC error there is no extra
>> information being printed so we can't really reason about what might be
>> going wrong in the metadata flushing algorithms.

Sorry, I clearly missed that one. I have applied the patch you referenced and rebooted the VM in question. This morning we had another FS failure on the same machine that caused it to go into readonly mode. This happened after that device was experiencing 100% I/O utilization for some time. No balance was running at the time; last balance finished about 6 hours prior to the error.

Kernel messages:
[211238.262683] use_block_rsv: 163 callbacks suppressed
[211238.262683] BTRFS: block rsv returned -28
[211238.266718] ------------[ cut here ]------------
[211238.270462] WARNING: CPU: 0 PID: 391 at fs/btrfs/extent-tree.c:8463 btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
[211238.277203] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev parport_pc intel_rapl_perf parport serio_raw evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm drm_kms_helper aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse drm ena scsi_mod i2c_piix4 button
[211238.319618] CPU: 0 PID: 391 Comm: btrfs-transacti Tainted: G        W       4.14.13 #3
[211238.325479] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[211238.330742] task: ffff9cb43abb70c0 task.stack: ffffb234c3b58000
[211238.335575] RIP: 0010:btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
[211238.340454] RSP: 0018:ffffb234c3b5b958 EFLAGS: 00010282
[211238.344782] RAX: 000000000000001d RBX: ffff9cb43bdea128 RCX: 0000000000000000
[211238.350562] RDX: 0000000000000000 RSI: ffff9cb440a166f8 RDI: ffff9cb440a166f8
[211238.356066] RBP: 0000000000004000 R08: 0000000000000001 R09: 0000000000007d81
[211238.361649] R10: 0000000000000001 R11: 0000000000007d81 R12: ffff9cb43bdea000
[211238.367304] R13: ffff9cb437f2c800 R14: 0000000000000001 R15: 00000000ffffffe4
[211238.372658] FS:  0000000000000000(0000) GS:ffff9cb440a00000(0000) knlGS:0000000000000000
[211238.379048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[211238.384681] CR2: 00007f90a6677000 CR3: 00000003cea0a006 CR4: 00000000001606f0
[211238.391380] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[211238.398050] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[211238.404730] Call Trace:
[211238.407880]  __btrfs_cow_block+0x125/0x5c0 [btrfs]
[211238.412455]  btrfs_cow_block+0xcb/0x1b0 [btrfs]
[211238.416292]  btrfs_search_slot+0x1fd/0x9e0 [btrfs]
[211238.420630]  lookup_inline_extent_backref+0x105/0x610 [btrfs]
[211238.425215]  __btrfs_free_extent.isra.61+0xf5/0xd30 [btrfs]
[211238.429663]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
[211238.434077]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
[211238.438541]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
[211238.442899]  ? remove_wait_queue+0x60/0x60
[211238.446503]  transaction_kthread+0x195/0x1b0 [btrfs]
[211238.450578]  kthread+0xfc/0x130
[211238.453924]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
[211238.458381]  ? kthread_create_on_node+0x70/0x70
[211238.462225]  ? do_group_exit+0x3a/0xa0
[211238.465586]  ret_from_fork+0x1f/0x30
[211238.468814] Code: ff 48 c7 c6 28 97 58 c0 48 c7 c7 a0 e1 5d c0 e8 0c d0 f7 d5 85 c0 0f 84 1c fd ff ff 44 89 fe 48 c7 c7 58 0c 59 c0 e8 70 2f 9e d5 <0f> ff e9 06 fd ff ff 4c 63 e8 31 d2 48 89 ee 48 89 df e8 4e eb
[211238.482366] ---[ end trace 48dd1ab4e2e46f6e ]---
[211238.486524] BTRFS info (device xvdc): space_info 4 has 18446744073258958848 free, is not full
[211238.493014] BTRFS info (device xvdc): space_info total=10737418240, used=7828127744, pinned=2128166912, reserved=243367936, may_use=988282880, readonly=65536
[211238.503799] BTRFS: Transaction aborted (error -28)
[211238.503824] ------------[ cut here ]------------
[211238.507828] WARNING: CPU: 0 PID: 391 at fs/btrfs/extent-tree.c:7081 __btrfs_free_extent.isra.61+0xaed/0xd30 [btrfs]
[211238.514371] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev parport_pc intel_rapl_perf parport serio_raw evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm drm_kms_helper aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse drm ena scsi_mod i2c_piix4 button
[211238.552170] CPU: 0 PID: 391 Comm: btrfs-transacti Tainted: G        W       4.14.13 #3
[211238.557481] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[211238.562201] task: ffff9cb43abb70c0 task.stack: ffffb234c3b58000
[211238.566531] RIP: 0010:__btrfs_free_extent.isra.61+0xaed/0xd30 [btrfs]
[211238.571114] RSP: 0018:ffffb234c3b5bc30 EFLAGS: 00010286
[211238.574984] RAX: 0000000000000026 RBX: 000000ffb55b9000 RCX: 0000000000000006
[211238.579691] RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff9cb440a166f0
[211238.584312] RBP: 00000000ffffffe4 R08: 0000000000000001 R09: 0000000000007da8
[211238.589144] R10: 0000000000000001 R11: 0000000000007da8 R12: ffff9cb43bdea000
[211238.594002] R13: ffff9caca472ed90 R14: 0000000000000000 R15: 000000000000a872
[211238.598817] FS:  0000000000000000(0000) GS:ffff9cb440a00000(0000) knlGS:0000000000000000
[211238.604230] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[211238.608419] CR2: 00007f90a6677000 CR3: 00000003cea0a006 CR4: 00000000001606f0
[211238.613183] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[211238.617894] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[211238.622673] Call Trace:
[211238.625288]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
[211238.629399]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
[211238.633234]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
[211238.637135]  ? remove_wait_queue+0x60/0x60
[211238.640505]  transaction_kthread+0x195/0x1b0 [btrfs]
[211238.644313]  kthread+0xfc/0x130
[211238.647251]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
[211238.651407]  ? kthread_create_on_node+0x70/0x70
[211238.654830]  ? do_group_exit+0x3a/0xa0
[211238.657961]  ret_from_fork+0x1f/0x30
[211238.661090] Code: 33 1b 0a 00 e9 39 f9 ff ff 8b 74 24 30 48 c7 c7 c8 07 59 c0 e8 a0 d7 9e d5 0f ff eb cc 89 ee 48 c7 c7 c8 07 59 c0 e8 8e d7 9e d5 <0f> ff e9 f9 f8 ff ff 8b 94 24 c0 00 00 00 48 89 c1 49 89 d8 48
[211238.672818] ---[ end trace 48dd1ab4e2e46f6f ]---
[211238.676465] BTRFS: error (device xvdc) in __btrfs_free_extent:7081: errno=-28 No space left
[211238.676713] BTRFS: error (device xvdc) in btrfs_drop_snapshot:9337: errno=-28 No space left
[211238.676716] BTRFS info (device xvdc): forced readonly
[211238.694217] BTRFS: error (device xvdc) in btrfs_run_delayed_refs:3089: errno=-28 No space left
[211238.700660] BTRFS warning (device xvdc): Skipping commit of aborted transaction.
[211238.706303] BTRFS: error (device xvdc) in cleanup_transaction:1873: errno=-28 No space left
...
[211243.218453] BTRFS error (device xvdc): pending csums is 4448256


btrfs filesystem usage:

Overall:
    Device size:		   1.37TiB
    Device allocated:		 656.04GiB
    Device unallocated:		 743.96GiB
    Device missing:		     0.00B
    Used:			 630.21GiB
    Free (estimated):		 767.05GiB	(min: 767.05GiB)
    Data ratio:			      1.00
    Metadata ratio:		      1.00
    Global reserve:		 512.00MiB	(used: 0.00B)

Data,single: Size:646.01GiB, Used:622.92GiB
   /dev/xvdc	 646.01GiB

Metadata,single: Size:10.00GiB, Used:7.29GiB
   /dev/xvdc	  10.00GiB

System,single: Size:32.00MiB, Used:112.00KiB
   /dev/xvdc	  32.00MiB

Unallocated:
   /dev/xvdc	 743.96GiB

>> <snip>
>>> [496003.641729] BTRFS: error (device xvdc) in __btrfs_free_extent:7076: errno=-28 No space left
>>> [496003.641994] BTRFS: error (device xvdc) in btrfs_drop_snapshot:9332: errno=-28 No space left
>>> [496003.641996] BTRFS info (device xvdc): forced readonly
>>> [496003.641998] BTRFS: error (device xvdc) in merge_reloc_roots:2470: errno=-28 No space left
>>> [496003.642060] BUG: unable to handle kernel NULL pointer dereference at           (null)
>>> [496003.642086] IP: __del_reloc_root+0x3c/0x100 [btrfs]
>>> [496003.642087] PGD 80000005fe08c067 P4D 80000005fe08c067 PUD 3bd2f4067 PMD 0
>>> [496003.642091] Oops: 0000 [#1] SMP PTI
>>> [496003.642093] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev intel_rapl_perf serio_raw parport_pc parport evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm aesni_intel aes_x86_64 crypto_simd drm_kms_helper cryptd glue_helper ena psmouse drm scsi_mod i2c_piix4 button
>>> [496003.642128] CPU: 1 PID: 25327 Comm: btrfs Tainted: G        W       4.14.0-0.bpo.3-amd64 #1 Debian 4.14.13-1~bpo9+1
>>> [496003.642129] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
>>> [496003.642130] task: ffff8fbffb8dd080 task.stack: ffff9e81c7b8c000
>>> [496003.642149] RIP: 0010:__del_reloc_root+0x3c/0x100 [btrfs]
>> 
>> 
>> if you happen to have the vmlinux of that kernel can you run the
>> following from the kernel source directory:
>> 
>> ./scripts/faddr2line  __del_reloc_root+0x3c/0x100 vmlinux
>> 

I used extract-vmlinux on the Debian-provided kernel, which is what was running previously, then ran the faddr2line command (with the filename first), but it couldn't find a match. I also did the same thing for the current kernel (for which I have a full vmlinux file available) with the function/offset from today's message, and still, no match.

$ scripts/faddr2line vmlinux btrfs_alloc_tree_block+0x39b/0x4c0
no match for btrfs_alloc_tree_block+0x39b/0x4c0


> I thought this was fixed by bb166d7 btrfs: fix NULL pointer dereference from free_reloc_roots(),
> Alex, do you mind checking if it's included in your kernel?
> 
> You can also check if the following change is merged in kernel-src deb.
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 3a49a3c..9841fae 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -2400,11 +2400,11 @@ void free_reloc_roots(struct list_head *list)
>        while (!list_empty(list)) {
>                reloc_root = list_entry(list->next, struct btrfs_root,
>                                        root_list);
> +               __del_reloc_root(reloc_root);
>                free_extent_buffer(reloc_root->node);
>                free_extent_buffer(reloc_root->commit_root);
>                reloc_root->node = NULL;
>                reloc_root->commit_root = NULL;
> -               __del_reloc_root(reloc_root);
>        }
> }
> 


I can confirm that this commit has indeed been applied to both the kernel we were running this morning that error'ed out, as well as the stock Debian linux-image-4.14.0-0.bpo.3-amd64 package that we were running when last week's error happened.


On Mar 1, 2018, at 10:02 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> Would you please try to use "btrfs check" to check the filesystem offline?
> I'm wondering if it's extent tree or free space cache get corrupted and
> makes kernel confused about its space allocation.
> 
> 
> I'm not completely sure, but it may also be something wrong with the
> space cache.
> 
> So either mount it with nospace_cache option of use "btrfs check
> --clear-space-cache v1" may help.


Running btrfs check some time after the VM had been rebooted and FS had been remounted (and then unmounted), I didn't see any btrfs check errors. However, running btrfs check today immediately after rebooting the VM, before remounting the FS, yielded the following output:

# btrfs check /dev/xvdc
Checking filesystem on /dev/xvdc
UUID: 2dd35565-641a-40f7-b29d-916c5c7e3ffe
checking extents
parent transid verify failed on 929281753088 wanted 202268 found 202782
parent transid verify failed on 929281753088 wanted 202268 found 202782
Ignoring transid failure
leaf parent key incorrect 929281753088
bad block 929281753088
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
checking csums
checking root refs
Transid errors in file system
found 676316246080 bytes used err is 1
total csum bytes: 652799620
total tree bytes: 7539572736
total fs tree bytes: 3933503488
total extent tree bytes: 2580004864
btree space waste bytes: 1502581682
file data blocks allocated: 4644964134912
 referenced 1503039082496


Alex--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Borisov March 9, 2018, 9:54 a.m. UTC | #2
<snip>
> Sorry, I clearly missed that one. I have applied the patch you referenced and rebooted the VM in question. This morning we had another FS failure on the same machine that caused it to go into readonly mode. This happened after that device was experiencing 100% I/O utilization for some time. No balance was running at the time; last balance finished about 6 hours prior to the error.
> 
> Kernel messages:
> [211238.262683] use_block_rsv: 163 callbacks suppressed
> [211238.262683] BTRFS: block rsv returned -28
> [211238.266718] ------------[ cut here ]------------
> [211238.270462] WARNING: CPU: 0 PID: 391 at fs/btrfs/extent-tree.c:8463 btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
> [211238.277203] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev parport_pc intel_rapl_perf parport serio_raw evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm drm_kms_helper aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse drm ena scsi_mod i2c_piix4 button
> [211238.319618] CPU: 0 PID: 391 Comm: btrfs-transacti Tainted: G        W       4.14.13 #3
> [211238.325479] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
> [211238.330742] task: ffff9cb43abb70c0 task.stack: ffffb234c3b58000
> [211238.335575] RIP: 0010:btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
> [211238.340454] RSP: 0018:ffffb234c3b5b958 EFLAGS: 00010282
> [211238.344782] RAX: 000000000000001d RBX: ffff9cb43bdea128 RCX: 0000000000000000
> [211238.350562] RDX: 0000000000000000 RSI: ffff9cb440a166f8 RDI: ffff9cb440a166f8
> [211238.356066] RBP: 0000000000004000 R08: 0000000000000001 R09: 0000000000007d81
> [211238.361649] R10: 0000000000000001 R11: 0000000000007d81 R12: ffff9cb43bdea000
> [211238.367304] R13: ffff9cb437f2c800 R14: 0000000000000001 R15: 00000000ffffffe4
> [211238.372658] FS:  0000000000000000(0000) GS:ffff9cb440a00000(0000) knlGS:0000000000000000
> [211238.379048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [211238.384681] CR2: 00007f90a6677000 CR3: 00000003cea0a006 CR4: 00000000001606f0
> [211238.391380] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [211238.398050] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [211238.404730] Call Trace:
> [211238.407880]  __btrfs_cow_block+0x125/0x5c0 [btrfs]
> [211238.412455]  btrfs_cow_block+0xcb/0x1b0 [btrfs]
> [211238.416292]  btrfs_search_slot+0x1fd/0x9e0 [btrfs]
> [211238.420630]  lookup_inline_extent_backref+0x105/0x610 [btrfs]
> [211238.425215]  __btrfs_free_extent.isra.61+0xf5/0xd30 [btrfs]
> [211238.429663]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
> [211238.434077]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
> [211238.438541]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
> [211238.442899]  ? remove_wait_queue+0x60/0x60
> [211238.446503]  transaction_kthread+0x195/0x1b0 [btrfs]
> [211238.450578]  kthread+0xfc/0x130
> [211238.453924]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
> [211238.458381]  ? kthread_create_on_node+0x70/0x70
> [211238.462225]  ? do_group_exit+0x3a/0xa0
> [211238.465586]  ret_from_fork+0x1f/0x30
> [211238.468814] Code: ff 48 c7 c6 28 97 58 c0 48 c7 c7 a0 e1 5d c0 e8 0c d0 f7 d5 85 c0 0f 84 1c fd ff ff 44 89 fe 48 c7 c7 58 0c 59 c0 e8 70 2f 9e d5 <0f> ff e9 06 fd ff ff 4c 63 e8 31 d2 48 89 ee 48 89 df e8 4e eb
> [211238.482366] ---[ end trace 48dd1ab4e2e46f6e ]---
> [211238.486524] BTRFS info (device xvdc): space_info 4 has 18446744073258958848 free, is not full
> [211238.493014] BTRFS info (device xvdc): space_info total=10737418240, used=7828127744, pinned=2128166912, reserved=243367936, may_use=988282880, readonly=65536

Ok so the numbers here are helpful, they show that we have enough space
to allocate a chunk. I've also looked at the logic in 4.14.13 and all
the necessary patches are there. Unfortunately none of this matters due
to the fact that reserve_metadata_bytes is being called with
BTRFS_RESERVE_NO_FLUSH from use_block_rsv, meaning the code won't make
any effort to flush anything at all.

Can you tell again what the workload is - is it some sort of a database,
constantly writing to its files? If so, btrfs is not really suited for
rewrite-heavy workloads since this causes excessive CoW. In such cases
you really ought to set nodatacow on the specified files. For more
information:

https://btrfs.wiki.kernel.org/index.php/FAQ#Can_copy-on-write_be_turned_off_for_data_blocks.3F

The other thing that comes to mind is to try and tune the default commit
interval. Currently this is 30 seconds, meaning a transaction will
happen roughly every 30 seconds (unless there is enough data batched
that it should happen "NOW", where "NOW" is defined as "during the life
cycle of some arbitrary operation"). Perhaps in your case you could
potentially lower it to, say 15s. This will likely have performance
impact but should reduce the premature ENOSPC since there should be a
lot less data in the transaction.

In the meantime I will investigate what can be done in case we get
ENOSPC from NO_FLUSH contexts, my feeling is that we should be able to
force allocate a new chunk. But this will take some time to verify.

<snip>

> 
> 
> btrfs filesystem usage:
> 
> Overall:
>     Device size:		   1.37TiB
>     Device allocated:		 656.04GiB
>     Device unallocated:		 743.96GiB
>     Device missing:		     0.00B
>     Used:			 630.21GiB
>     Free (estimated):		 767.05GiB	(min: 767.05GiB)
>     Data ratio:			      1.00
>     Metadata ratio:		      1.00
>     Global reserve:		 512.00MiB	(used: 0.00B)
> 
> Data,single: Size:646.01GiB, Used:622.92GiB
>    /dev/xvdc	 646.01GiB
> 
> Metadata,single: Size:10.00GiB, Used:7.29GiB
>    /dev/xvdc	  10.00GiB
> 
> System,single: Size:32.00MiB, Used:112.00KiB
>    /dev/xvdc	  32.00MiB
> 
> Unallocated:
>    /dev/xvdc	 743.96GiB
> 
>>> <snip>
>>>> [496003.641729] BTRFS: error (device xvdc) in __btrfs_free_extent:7076: errno=-28 No space left
>>>> [496003.641994] BTRFS: error (device xvdc) in btrfs_drop_snapshot:9332: errno=-28 No space left
>>>> [496003.641996] BTRFS info (device xvdc): forced readonly
>>>> [496003.641998] BTRFS: error (device xvdc) in merge_reloc_roots:2470: errno=-28 No space left
>>>> [496003.642060] BUG: unable to handle kernel NULL pointer dereference at           (null)
>>>> [496003.642086] IP: __del_reloc_root+0x3c/0x100 [btrfs]
>>>> [496003.642087] PGD 80000005fe08c067 P4D 80000005fe08c067 PUD 3bd2f4067 PMD 0
>>>> [496003.642091] Oops: 0000 [#1] SMP PTI
>>>> [496003.642093] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev intel_rapl_perf serio_raw parport_pc parport evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm aesni_intel aes_x86_64 crypto_simd drm_kms_helper cryptd glue_helper ena psmouse drm scsi_mod i2c_piix4 button
>>>> [496003.642128] CPU: 1 PID: 25327 Comm: btrfs Tainted: G        W       4.14.0-0.bpo.3-amd64 #1 Debian 4.14.13-1~bpo9+1
>>>> [496003.642129] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
>>>> [496003.642130] task: ffff8fbffb8dd080 task.stack: ffff9e81c7b8c000
>>>> [496003.642149] RIP: 0010:__del_reloc_root+0x3c/0x100 [btrfs]
>>>
>>>
>>> if you happen to have the vmlinux of that kernel can you run the
>>> following from the kernel source directory:
>>>
>>> ./scripts/faddr2line  __del_reloc_root+0x3c/0x100 vmlinux
>>>
> 
> I used extract-vmlinux on the Debian-provided kernel, which is what was running previously, then ran the faddr2line command (with the filename first), but it couldn't find a match. I also did the same thing for the current kernel (for which I have a full vmlinux file available) with the function/offset from today's message, and still, no match.
> 
> $ scripts/faddr2line vmlinux btrfs_alloc_tree_block+0x39b/0x4c0
> no match for btrfs_alloc_tree_block+0x39b/0x4c0
> 
> 
>> I thought this was fixed by bb166d7 btrfs: fix NULL pointer dereference from free_reloc_roots(),
>> Alex, do you mind checking if it's included in your kernel?
>>
>> You can also check if the following change is merged in kernel-src deb.
>>
>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>> index 3a49a3c..9841fae 100644
>> --- a/fs/btrfs/relocation.c
>> +++ b/fs/btrfs/relocation.c
>> @@ -2400,11 +2400,11 @@ void free_reloc_roots(struct list_head *list)
>>        while (!list_empty(list)) {
>>                reloc_root = list_entry(list->next, struct btrfs_root,
>>                                        root_list);
>> +               __del_reloc_root(reloc_root);
>>                free_extent_buffer(reloc_root->node);
>>                free_extent_buffer(reloc_root->commit_root);
>>                reloc_root->node = NULL;
>>                reloc_root->commit_root = NULL;
>> -               __del_reloc_root(reloc_root);
>>        }
>> }
>>
> 
> 
> I can confirm that this commit has indeed been applied to both the kernel we were running this morning that error'ed out, as well as the stock Debian linux-image-4.14.0-0.bpo.3-amd64 package that we were running when last week's error happened.
> 
> 
> On Mar 1, 2018, at 10:02 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> Would you please try to use "btrfs check" to check the filesystem offline?
>> I'm wondering if it's extent tree or free space cache get corrupted and
>> makes kernel confused about its space allocation.
>>
>>
>> I'm not completely sure, but it may also be something wrong with the
>> space cache.
>>
>> So either mount it with nospace_cache option of use "btrfs check
>> --clear-space-cache v1" may help.
> 
> 
> Running btrfs check some time after the VM had been rebooted and FS had been remounted (and then unmounted), I didn't see any btrfs check errors. However, running btrfs check today immediately after rebooting the VM, before remounting the FS, yielded the following output:
> 
> # btrfs check /dev/xvdc
> Checking filesystem on /dev/xvdc
> UUID: 2dd35565-641a-40f7-b29d-916c5c7e3ffe
> checking extents
> parent transid verify failed on 929281753088 wanted 202268 found 202782
> parent transid verify failed on 929281753088 wanted 202268 found 202782
> Ignoring transid failure
> leaf parent key incorrect 929281753088
> bad block 929281753088
> Errors found in extent allocation tree or chunk allocation
> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> Transid errors in file system
> found 676316246080 bytes used err is 1
> total csum bytes: 652799620
> total tree bytes: 7539572736
> total fs tree bytes: 3933503488
> total extent tree bytes: 2580004864
> btree space waste bytes: 1502581682
> file data blocks allocated: 4644964134912
>  referenced 1503039082496
> 
> 
> Alex
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Adriaanse March 9, 2018, 7:05 p.m. UTC | #3
On Mar 9, 2018, at 3:54 AM, Nikolay Borisov <nborisov@suse.com> wrote:
> <snip>
>> Sorry, I clearly missed that one. I have applied the patch you referenced and rebooted the VM in question. This morning we had another FS failure on the same machine that caused it to go into readonly mode. This happened after that device was experiencing 100% I/O utilization for some time. No balance was running at the time; last balance finished about 6 hours prior to the error.
>> 
>> Kernel messages:
>> [211238.262683] use_block_rsv: 163 callbacks suppressed
>> [211238.262683] BTRFS: block rsv returned -28
>> [211238.266718] ------------[ cut here ]------------
>> [211238.270462] WARNING: CPU: 0 PID: 391 at fs/btrfs/extent-tree.c:8463 btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
>> [211238.277203] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev parport_pc intel_rapl_perf parport serio_raw evdev ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm drm_kms_helper aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse drm ena scsi_mod i2c_piix4 button
>> [211238.319618] CPU: 0 PID: 391 Comm: btrfs-transacti Tainted: G        W       4.14.13 #3
>> [211238.325479] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
>> [211238.330742] task: ffff9cb43abb70c0 task.stack: ffffb234c3b58000
>> [211238.335575] RIP: 0010:btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
>> [211238.340454] RSP: 0018:ffffb234c3b5b958 EFLAGS: 00010282
>> [211238.344782] RAX: 000000000000001d RBX: ffff9cb43bdea128 RCX: 0000000000000000
>> [211238.350562] RDX: 0000000000000000 RSI: ffff9cb440a166f8 RDI: ffff9cb440a166f8
>> [211238.356066] RBP: 0000000000004000 R08: 0000000000000001 R09: 0000000000007d81
>> [211238.361649] R10: 0000000000000001 R11: 0000000000007d81 R12: ffff9cb43bdea000
>> [211238.367304] R13: ffff9cb437f2c800 R14: 0000000000000001 R15: 00000000ffffffe4
>> [211238.372658] FS:  0000000000000000(0000) GS:ffff9cb440a00000(0000) knlGS:0000000000000000
>> [211238.379048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [211238.384681] CR2: 00007f90a6677000 CR3: 00000003cea0a006 CR4: 00000000001606f0
>> [211238.391380] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [211238.398050] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [211238.404730] Call Trace:
>> [211238.407880]  __btrfs_cow_block+0x125/0x5c0 [btrfs]
>> [211238.412455]  btrfs_cow_block+0xcb/0x1b0 [btrfs]
>> [211238.416292]  btrfs_search_slot+0x1fd/0x9e0 [btrfs]
>> [211238.420630]  lookup_inline_extent_backref+0x105/0x610 [btrfs]
>> [211238.425215]  __btrfs_free_extent.isra.61+0xf5/0xd30 [btrfs]
>> [211238.429663]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
>> [211238.434077]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
>> [211238.438541]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
>> [211238.442899]  ? remove_wait_queue+0x60/0x60
>> [211238.446503]  transaction_kthread+0x195/0x1b0 [btrfs]
>> [211238.450578]  kthread+0xfc/0x130
>> [211238.453924]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
>> [211238.458381]  ? kthread_create_on_node+0x70/0x70
>> [211238.462225]  ? do_group_exit+0x3a/0xa0
>> [211238.465586]  ret_from_fork+0x1f/0x30
>> [211238.468814] Code: ff 48 c7 c6 28 97 58 c0 48 c7 c7 a0 e1 5d c0 e8 0c d0 f7 d5 85 c0 0f 84 1c fd ff ff 44 89 fe 48 c7 c7 58 0c 59 c0 e8 70 2f 9e d5 <0f> ff e9 06 fd ff ff 4c 63 e8 31 d2 48 89 ee 48 89 df e8 4e eb
>> [211238.482366] ---[ end trace 48dd1ab4e2e46f6e ]---
>> [211238.486524] BTRFS info (device xvdc): space_info 4 has 18446744073258958848 free, is not full
>> [211238.493014] BTRFS info (device xvdc): space_info total=10737418240, used=7828127744, pinned=2128166912, reserved=243367936, may_use=988282880, readonly=65536
> 
> Ok so the numbers here are helpful, they show that we have enough space
> to allocate a chunk. I've also looked at the logic in 4.14.13 and all
> the necessary patches are there. Unfortunately none of this matters due
> to the fact that reserve_metadata_bytes is being called with
> BTRFS_RESERVE_NO_FLUSH from use_block_rsv, meaning the code won't make
> any effort to flush anything at all.
> 
> Can you tell again what the workload is - is it some sort of a database,
> constantly writing to its files?

Yes, we have PostgreSQL databases running these VMs that put a heavy I/O load on these machines. We also have snapshots being deleted and created every 15 minutes. Looking at historical atop data for the two most recent crashes:

1. Right before last week's crash, looking at the last three minutes before crash, Postgres databases were averaging a total of 45 MB/s in reads and 29 MB/s in writes. btrfs balance was also running at that time, which was reading at 1 MB/s and writing at 30 MB/s. btrfs-transacti kernel processes were doing only a small amount disk I/O, until a minute before the crash, at which time they ramped up to 0.4 MB/s in reads and 6 MB/s in writes. For cumulative I/O on the entire VM In the three minutes prior to the crash we saw an average of 95% I/O utilization, 2,200 read ops/sec, 47 MB/s read, 2,700 write ops/sec, 77 MB/s written.

2. Right before yesterday's crash (which had no balance running), Postgres databases were averaging a total of 81 MB/s in reads and 14 MB/s in writes. btrfs-cleaner and btrfs-transacti kernel processes weren't doing much I/O until the minute before the crash, at which point btrfs-cleaner was reading at 2 MB/s and writing at 16 MB/s, and btrfs-transacti was reading at 1 MB/s and writing at 5 MB/s. For cumulative I/O on the entire VM In the three minutes prior to the crash we saw an average of 103% I/O utilization, 2,600 read ops/sec, 86 MB/s read, 1,300 write ops/sec, 35 MB/s written.

I just realized both crashes happened within 5 minutes of snapshot creation + deletion. Not sure if that's just a coincidence.


> If so, btrfs is not really suited for
> rewrite-heavy workloads since this causes excessive CoW. In such cases
> you really ought to set nodatacow on the specified files. For more
> information:
> 
> https://btrfs.wiki.kernel.org/index.php/FAQ#Can_copy-on-write_be_turned_off_for_data_blocks.3F

Am I correct to understand that nodatacow doesn't really avoid CoW when you're using snapshots? In a filesystem that's snapshotted every 15 minutes, is there a difference between normal CoW and nodatacow when (in the case of Postgres) you update a small portion of a 1GB file many times per minute? Do you anticipate us seeing a benefit in stability and performance if we set nodatacow for the entire FS while retaining snapshots? Does nodatacow increase the chance of corruption in a database like Postgres, i.e. are writes still properly ordered/sync'ed when flushed to disk?


> The other thing that comes to mind is to try and tune the default commit
> interval. Currently this is 30 seconds, meaning a transaction will
> happen roughly every 30 seconds (unless there is enough data batched
> that it should happen "NOW", where "NOW" is defined as "during the life
> cycle of some arbitrary operation"). Perhaps in your case you could
> potentially lower it to, say 15s. This will likely have performance
> impact but should reduce the premature ENOSPC since there should be a
> lot less data in the transaction.
> 
> In the meantime I will investigate what can be done in case we get
> ENOSPC from NO_FLUSH contexts, my feeling is that we should be able to
> force allocate a new chunk. But this will take some time to verify.

Thanks for your help!

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Borisov March 10, 2018, 12:04 p.m. UTC | #4
On  9.03.2018 21:05, Alex Adriaanse wrote:
> Am I correct to understand that nodatacow doesn't really avoid CoW when you're using snapshots? In a filesystem that's snapshotted 

Yes, so nodatacow won't interfere with how snapshots operate. For more
information on that topic check the following mailing list thread:
https://www.spinics.net/lists/linux-btrfs/msg62715.html

every 15 minutes, is there a difference between normal CoW and nodatacow
when (in the case of Postgres) you update a small portion of a 1GB file
many times per minute? Do you anticipate us seeing a benefit in
stability and performance if we set nodatacow for the
So regarding this, you can check :
https://btrfs.wiki.kernel.org/index.php/Gotchas#Fragmentation

Essentially every bit of small, random postgres update in the db file
will cause a CoW operation + checksum IO which cause, and I quote, "
thrashing on HDDs and excessive multi-second spikes of CPU load on
systems with an SSD or large amount a RAM."

So for OLTP workloads you definitely want nodatacow enabled, bear in
mind this also disables crc checksumming, but your db engine should
already have such functionality implemented in it.

entire FS while retaining snapshots? Does nodatacow increase the chance
of corruption in a database like Postgres, i.e. are writes still
properly ordered/sync'ed when flushed to disk?

Well most modern DB already implement some sort of a WAL, so the
reliability responsibility is shifted on the db engine.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Anton Mitterer March 10, 2018, 2:29 p.m. UTC | #5
On Sat, 2018-03-10 at 14:04 +0200, Nikolay Borisov wrote:
> So for OLTP workloads you definitely want nodatacow enabled, bear in
> mind this also disables crc checksumming, but your db engine should
> already have such functionality implemented in it.

Unlike repeated claims made here on the list and other places... I
woudln't now *any* DB system which actually does this per default and
or in a way that would be comparable to filesystem lvl checksumming.


Look back in the archives... when I've asked several times for
checksumming support *with* nodatacow, I evaluated the existing status
for the big ones (postgres,mysql,sqlite,bdb)... and all of them had
this either not enabled per default, not at all, or requiring special
support for the program using the DB.


Similar btw: no single VM image type I've evaluated back then had any
form of checksumming integrated.


Still, one of the major deficiencies (not in comparison to other fs,
but in comparison to how it should be) of btrfs unfortunately :-(


Cheers,
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli March 11, 2018, 5:51 p.m. UTC | #6
On 03/10/2018 03:29 PM, Christoph Anton Mitterer wrote:
> On Sat, 2018-03-10 at 14:04 +0200, Nikolay Borisov wrote:
>> So for OLTP workloads you definitely want nodatacow enabled, bear in
>> mind this also disables crc checksumming, but your db engine should
>> already have such functionality implemented in it.
> 
> Unlike repeated claims made here on the list and other places... I
> woudln't now *any* DB system which actually does this per default and
> or in a way that would be comparable to filesystem lvl checksumming.
> 

I agree with you, also nobody warn that without checksum in case of RAID filesystem BTRFS is not capable anymore to check if a stripe is correct or not

> 
> Look back in the archives... when I've asked several times for
> checksumming support *with* nodatacow, I evaluated the existing status
> for the big ones (postgres,mysql,sqlite,bdb)... and all of them had
> this either not enabled per default, not at all, or requiring special
> support for the program using the DB.
> 
> 
> Similar btw: no single VM image type I've evaluated back then had any
> form of checksumming integrated.
> 
> 
> Still, one of the major deficiencies (not in comparison to other fs,
> but in comparison to how it should be) of btrfs unfortunately :-(

COW is needed to properly checksum the data. Otherwise is not possible to ensure the coherency between data and checksum (however I have to point out that BTRFS fails even in this case [*]).
We could rearrange this sentence, saying that: if you want checksum, you need COW...

> 
> 
> Cheers,
> Chris.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

[*] https://www.spinics.net/lists/linux-btrfs/msg69185.html
Christoph Anton Mitterer March 11, 2018, 10:37 p.m. UTC | #7
On Sun, 2018-03-11 at 18:51 +0100, Goffredo Baroncelli wrote:
> 
> COW is needed to properly checksum the data. Otherwise is not
> possible to ensure the coherency between data and checksum (however I
> have to point out that BTRFS fails even in this case [*]).
> We could rearrange this sentence, saying that: if you want checksum,
> you need COW...

No,... not really... the meta-data is anyway always CoWed... so if you
do checksum *and* notdatacow,..., the only thing that could possibly
happen (in the worst case) is, that data that actually made it
correctly to the disk is falsely determined bad, as the metadata (i.e.
the checksums) weren't upgraded correctly.

That however is probably much less likely than the other way round,..
i.e. bad data went to disk and would be detected with checksuming.


I had lots of discussions about this here on the list, and no one ever
brought up a real argument against it... I also had an off-list
discussion with Chris Mason who IIRC confirmed that it would actually
work as I imagine it... with the only two problems:
- good data possibly be marked bad because of bad checksums
- reads giving back EIO where people would rather prefer bad data
(not really sure if this were really his two arguments,... I'd have to
look it up, so don't nail me down).


Long story short:

In any case, I think giving back bad data without EIO is unacceptable.
If someone really doesn't care (e.g. because he has higher level
checksumming and possibly even repair) he could still manually disable
checksumming.

The little chance of having a false positive weights IMO far less that
have very large amounts of data (DBs, VM images are our typical cases)
completely unprotected.

And not having checksumming with notdatacow breaks any safe raid repair
(so in that case "repair" may even overwrite good data),... which is
IMO also unacceptable.
And the typical use cases for nodatacow (VMs, DBs) are in turn not so
uncommon to want RAID.


I really like btrfs,... and it's not that other fs (which typically
have no checksumming at all) would perform better here... but not
having it for these major use case is a big disappointment for me.


Cheers,
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli March 12, 2018, 9:22 p.m. UTC | #8
On 03/11/2018 11:37 PM, Christoph Anton Mitterer wrote:
> On Sun, 2018-03-11 at 18:51 +0100, Goffredo Baroncelli wrote:
>>
>> COW is needed to properly checksum the data. Otherwise is not
>> possible to ensure the coherency between data and checksum (however I
>> have to point out that BTRFS fails even in this case [*]).
>> We could rearrange this sentence, saying that: if you want checksum,
>> you need COW...
> 
> No,... not really... the meta-data is anyway always CoWed... so if you
> do checksum *and* notdatacow,..., the only thing that could possibly
> happen (in the worst case) is, that data that actually made it
> correctly to the disk is falsely determined bad, as the metadata (i.e.
> the checksums) weren't upgraded correctly.
> 
> That however is probably much less likely than the other way round,..
> i.e. bad data went to disk and would be detected with checksuming.

Unfortunately no, the likelihood might be 100%: there are some patterns which trigger this problem quite easily. See The link which I posted in my previous email. There was a program which creates a bad checksum (in COW+DATASUM mode), and the file became unreadable.

> 
> 
> I had lots of discussions about this here on the list, and no one ever
> brought up a real argument against it... I also had an off-list
> discussion with Chris Mason who IIRC confirmed that it would actually
> work as I imagine it... with the only two problems:
> - good data possibly be marked bad because of bad checksums
> - reads giving back EIO where people would rather prefer bad data

If you cannot know if a checksum is bad or the data is bad, the checksum is not useful at all!

If I read correctly what you wrote, it seems that you consider a "minor issue" the fact that the checksum is not correct. If you accept the possibility that a checksum might be wrong, you wont trust anymore the checksum; so the checksum became not useful.
 

> (not really sure if this were really his two arguments,... I'd have to
> look it up, so don't nail me down).
> 
> 
> Long story short:
> 
> In any case, I think giving back bad data without EIO is unacceptable.
> If someone really doesn't care (e.g. because he has higher level
> checksumming and possibly even repair) he could still manually disable
> checksumming.
> 
> The little chance of having a false positive weights IMO far less that
> have very large amounts of data (DBs, VM images are our typical cases)
> completely unprotected.

Again, you are assuming that the likelihood of having a bad checksum is low. Unfortunately this is not true. There are pattern which exploits this bug with a likelihood=100%.

> 
> And not having checksumming with notdatacow breaks any safe raid repair
> (so in that case "repair" may even overwrite good data),... which is
> IMO also unacceptable.
> And the typical use cases for nodatacow (VMs, DBs) are in turn not so
> uncommon to want RAID.
> 
> 
> I really like btrfs,... and it's not that other fs (which typically
> have no checksumming at all) would perform better here... but not
> having it for these major use case is a big disappointment for me.
> 
> 
> Cheers,
> Chris.
>
Christoph Anton Mitterer March 12, 2018, 9:48 p.m. UTC | #9
On Mon, 2018-03-12 at 22:22 +0100, Goffredo Baroncelli wrote:
> Unfortunately no, the likelihood might be 100%: there are some
> patterns which trigger this problem quite easily. See The link which
> I posted in my previous email. There was a program which creates a
> bad checksum (in COW+DATASUM mode), and the file became unreadable.
But that rather seems like a plain bug?!

No reason that would conceptually make checksumming+notdatacow
impossible.

AFAIU, the conceptual thin would be about:
- data is written in nodatacow
  => thus a checksum must be written as well, so write it
- what can then of course happen is
  - both csum and data are written => fine
  - csum is written but data not and then some crash => csum will show
    that => fine
  - data is written but csum not and then some crash => csum will give
    false positive

Still better few false positives, as many unnoticed data corruptions
and no true raid repair.


> If you cannot know if a checksum is bad or the data is bad, the
> checksum is not useful at all!
Why not? It's anyway only uncertain in the case of crash,... and it at
least tells you that something is fishy.
A program which cares about its data will ensure its own journaling
means and can simply recover by this... or users could then just roll
in a backup.
Or one could provide some API/userland tool to recompute the csums of
the affected file (and possibly live with bad data).


> If I read correctly what you wrote, it seems that you consider a
> "minor issue" the fact that the checksum is not correct. If you
> accept the possibility that a checksum might be wrong, you wont trust
> anymore the checksum; so the checksum became not useful.
There's simply no disadvantage to not having checksumming at all in the
nodatacow case.
Cause then you never have an idea whether your data is correct or
not... the case with checksumming + datacow, which can give a false
positive on a crash when data was written correctly, but not the
checksum, covers at least the other cases of data corruption (silent
data corruption, csum written, but data not or only partially in case
of a crash).


> Again, you are assuming that the likelihood of having a bad checksum
> is low. Unfortunately this is not true. There are pattern which
> exploits this bug with a likelihood=100%.

Okay I don't understand why this would be so and wouldn't assume that
the IO pattern can affect it heavily... but I'm not really btrfs
expert.

My blind assumption would have been that writing an extent of data
takes much longer to complete than writing the corresponding checksum.

Even if not... I should be only a problem in case of a crash during
that,.. and than I'd still prefer to get the false positive than bad
data.


Anyway... it's not going to happen so the discussion is pointless.
I think people can probably use dm-integrity (which btw: does no CoW
either (IIRC) and still can provide integrity... ;-) ) to see whether
their data is valid.
No nice but since it won't change on btrfs, a possible alternative.


Cheers,
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrik Lundquist March 13, 2018, 1:47 p.m. UTC | #10
On 9 March 2018 at 20:05, Alex Adriaanse <alex@oseberg.io> wrote:
>
> Yes, we have PostgreSQL databases running these VMs that put a heavy I/O load on these machines.

Dump the databases and recreate them with --data-checksums and Btrfs
No_COW attribute.

You can add this to /etc/postgresql-common/createcluster.conf in
Debian/Ubuntu if you use pg_createcluster:
initdb_options = '--data-checksums'
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli March 13, 2018, 7:36 p.m. UTC | #11
On 03/12/2018 10:48 PM, Christoph Anton Mitterer wrote:
> On Mon, 2018-03-12 at 22:22 +0100, Goffredo Baroncelli wrote:
>> Unfortunately no, the likelihood might be 100%: there are some
>> patterns which trigger this problem quite easily. See The link which
>> I posted in my previous email. There was a program which creates a
>> bad checksum (in COW+DATASUM mode), and the file became unreadable.
> But that rather seems like a plain bug?!

You are right, unfortunately it seems that it is catalogate as WONT-FIX :(

> No reason that would conceptually make checksumming+notdatacow
> impossible.
> 
> AFAIU, the conceptual thin would be about:
> - data is written in nodatacow
>   => thus a checksum must be written as well, so write it
> - what can then of course happen is
>   - both csum and data are written => fine
>   - csum is written but data not and then some crash => csum will show
>     that => fine
>   - data is written but csum not and then some crash => csum will give
>     false positive
> 
> Still better few false positives, as many unnoticed data corruptions
> and no true raid repair.

A checksum mismatch, is returned as -EIO by a read() syscall. This is an event handled badly by most part of the programs. 
I.e. suppose that a page of a VM ram image file has a wrong checksum. When the VM starts, tries to read the page, got -EIO and aborts. It is even possible that it could not print which page is corrupted. In this case, how the user understand the problem, and what he could do ?


[....]

> 
>> Again, you are assuming that the likelihood of having a bad checksum
>> is low. Unfortunately this is not true. There are pattern which
>> exploits this bug with a likelihood=100%.
> 
> Okay I don't understand why this would be so and wouldn't assume that
> the IO pattern can affect it heavily... but I'm not really btrfs
> expert.
> 
> My blind assumption would have been that writing an extent of data
> takes much longer to complete than writing the corresponding checksum.

The problem is the following: there is a time window between the checksum computation and the writing the data on the disk (which is done at the lower level via a DMA channel), where if the data is update the checksum would mismatch. This happens if we have two threads, where the first commits the data on the disk, and the second one updates the data (I think that both VM and database could behave so).

In btrfs, a checksum mismatch creates an -EIO error during the reading. In a conventional filesystem (or a btrfs filesystem w/o datasum) there is no checksum, so this problem doesn't exist. 

I am curious how ZFS solves this problem.

However I have to point out that this problem is not solved by the COW. COW solved only the problem about an interrupted commit of the filesystem, where the data is update in place (so it is available by the user), but the metadata not.


> 
> Even if not... I should be only a problem in case of a crash during
> that,.. and than I'd still prefer to get the false positive than bad
> data.

How you can know if it is a "bad data" or a "bad checksum" ?


> 
> 
> Anyway... it's not going to happen so the discussion is pointless.
> I think people can probably use dm-integrity (which btw: does no CoW
> either (IIRC) and still can provide integrity... ;-) ) to see whether
> their data is valid.
> No nice but since it won't change on btrfs, a possible alternative.

Even in this case I am curious about dm-integrity would sole this issue.

> 
> 
> Cheers,
> Chris.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Christoph Anton Mitterer March 13, 2018, 8:10 p.m. UTC | #12
On Tue, 2018-03-13 at 20:36 +0100, Goffredo Baroncelli wrote:
> A checksum mismatch, is returned as -EIO by a read() syscall. This is
> an event handled badly by most part of the programs.
Then these programs must simply be fixed... otherwise they'll also fail
under normal circumstances with btrfs, if there is any corruption.


> The problem is the following: there is a time window between the
> checksum computation and the writing the data on the disk (which is
> done at the lower level via a DMA channel), where if the data is
> update the checksum would mismatch. This happens if we have two
> threads, where the first commits the data on the disk, and the second
> one updates the data (I think that both VM and database could behave
> so).
Well that's clear... but isn't that time frame also there if the extent
is just written without CoW (regardless of checksumming)?
Obviously there would need to be some protection here anyway, so that
such data is taken e.g. from RAM, before the write has completed, so
that the read wouldn't take place while the write has only half
finished?!
So I'd naively assume one could just enlarge that protection to the
completion of checksum writing,...



> In btrfs, a checksum mismatch creates an -EIO error during the
> reading. In a conventional filesystem (or a btrfs filesystem w/o
> datasum) there is no checksum, so this problem doesn't exist.
If ext writes an extent (can't that be up to 128MiB there?), then I'm
sure it cannot write that atomically (in terms of hardware)... so there
is likely some protection around this operation, that there are no
concurrent reads of that particular extent from the disk, while the
write hasn't finished yet.



> > Even if not... I should be only a problem in case of a crash during
> > that,.. and than I'd still prefer to get the false positive than
> > bad
> > data.
> 
> How you can know if it is a "bad data" or a "bad checksum" ?
Well as I've said, in my naive thinking this should only be a problem
in case of a crash... and then, yes, one cannot say whether it's bad
data or checksum (that's exactly what I'm saying)... but I rather
prefer to know that something might be fishy, then not knowing anything
and perhaps even get good data "RAID-repaired" with bad one...


Cheers,
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Austin S. Hemmelgarn March 14, 2018, 12:02 p.m. UTC | #13
On 2018-03-13 15:36, Goffredo Baroncelli wrote:
> On 03/12/2018 10:48 PM, Christoph Anton Mitterer wrote:
>> On Mon, 2018-03-12 at 22:22 +0100, Goffredo Baroncelli wrote:
>>> Unfortunately no, the likelihood might be 100%: there are some
>>> patterns which trigger this problem quite easily. See The link which
>>> I posted in my previous email. There was a program which creates a
>>> bad checksum (in COW+DATASUM mode), and the file became unreadable.
>> But that rather seems like a plain bug?!
> 
> You are right, unfortunately it seems that it is catalogate as WONT-FIX :(
> 
>> No reason that would conceptually make checksumming+notdatacow
>> impossible.
>>
>> AFAIU, the conceptual thin would be about:
>> - data is written in nodatacow
>>    => thus a checksum must be written as well, so write it
>> - what can then of course happen is
>>    - both csum and data are written => fine
>>    - csum is written but data not and then some crash => csum will show
>>      that => fine
>>    - data is written but csum not and then some crash => csum will give
>>      false positive
>>
>> Still better few false positives, as many unnoticed data corruptions
>> and no true raid repair.
> 
> A checksum mismatch, is returned as -EIO by a read() syscall. This is an event handled badly by most part of the programs.
> I.e. suppose that a page of a VM ram image file has a wrong checksum. When the VM starts, tries to read the page, got -EIO and aborts. It is even possible that it could not print which page is corrupted. In this case, how the user understand the problem, and what he could do ?
Check the kernel log on the host system, which should have an error 
message saying which block failed.  If the VM itself actually gets to 
the point of booting into an OS (and properly propagates things like 
-EIO to the guest environment like it should), that OS should also log 
where the error was.

Most of the reason user applications don't tell you where the error was 
is because the kernel already does it on any sensible system, and the 
kernel tells you _exactly_ where the error was (exact block and device 
that threw the error), which user applications can't really do (they 
generally can't get sufficiently low-level information to give you all 
the info the kernel does).
> 
>>
>>> Again, you are assuming that the likelihood of having a bad checksum
>>> is low. Unfortunately this is not true. There are pattern which
>>> exploits this bug with a likelihood=100%.
>>
>> Okay I don't understand why this would be so and wouldn't assume that
>> the IO pattern can affect it heavily... but I'm not really btrfs
>> expert.
>>
>> My blind assumption would have been that writing an extent of data
>> takes much longer to complete than writing the corresponding checksum.
> 
> The problem is the following: there is a time window between the checksum computation and the writing the data on the disk (which is done at the lower level via a DMA channel), where if the data is update the checksum would mismatch. This happens if we have two threads, where the first commits the data on the disk, and the second one updates the data (I think that both VM and database could behave so).
Though it only matters if you use O_DIRECT or the files in question are 
NOCOW.
> 
> In btrfs, a checksum mismatch creates an -EIO error during the reading. In a conventional filesystem (or a btrfs filesystem w/o datasum) there is no checksum, so this problem doesn't exist.
> 
> I am curious how ZFS solves this problem.
It doesn't support disabling COW or the O_DIRECT flag, so it just never 
has the problem in the first place.
> 
> However I have to point out that this problem is not solved by the COW. COW solved only the problem about an interrupted commit of the filesystem, where the data is update in place (so it is available by the user), but the metadata not.
COW is irrelevant if you're bypassing it.  It's only enforced for 
metadata so that you don't have to check the FS every time you mount it 
(because the way BTRFS uses it guarantees consistency of the metadata).
> 
>>
>> Even if not... I should be only a problem in case of a crash during
>> that,.. and than I'd still prefer to get the false positive than bad
>> data.
> 
> How you can know if it is a "bad data" or a "bad checksum" ?
You can't directly.  Just like you can't know which copy in a two-device 
MD RAID1 array is bad when they mismatch.

That's part of why I'm not all that fond of the idea of having checksums 
without COW, you need to verify the data using secondary means anyway, 
so why exactly should you waste time verifying it twice?
> 
>>
>> Anyway... it's not going to happen so the discussion is pointless.
>> I think people can probably use dm-integrity (which btw: does no CoW
>> either (IIRC) and still can provide integrity... ;-) ) to see whether
>> their data is valid.
>> No nice but since it won't change on btrfs, a possible alternative.
> 
> Even in this case I am curious about dm-integrity would sole this issue.
dm-integrity uses journaling, and actually based on the testing I've 
done, will typically have much worse performance than the overhead of 
just enabling COW on files on BTRFS and manually defragmenting them on a 
regular basis.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli March 14, 2018, 6:39 p.m. UTC | #14
On 03/14/2018 01:02 PM, Austin S. Hemmelgarn wrote:
[...]
>>
>> In btrfs, a checksum mismatch creates an -EIO error during the reading. In a conventional filesystem (or a btrfs filesystem w/o datasum) there is no checksum, so this problem doesn't exist.
>>
>> I am curious how ZFS solves this problem.
> It doesn't support disabling COW or the O_DIRECT flag, so it just never has the problem in the first place.

I would like to perform some tests: however I think that you are right. if you make a "double buffering" approach (copy the data in the page cache, compute the checksum, then write the data to disk), the mismatch should not happen. Of course this is incompatible with O_DIRECT; but disabling O_DIRECT is a prerequisite to the "double buffering"; alone it couldn't be sufficient; what about mmap ? Are we sure that this does a double buffering ?

I would prefer that btrfs doesn't allow O_DIRECT with the COW files. I prefer this to the checksum mismatch bug.


>>
>> However I have to point out that this problem is not solved by the COW. COW solved only the problem about an interrupted commit of the filesystem, where the data is update in place (so it is available by the user), but the metadata not.
> COW is irrelevant if you're bypassing it.  It's only enforced for metadata so that you don't have to check the FS every time you mount it (because the way BTRFS uses it guarantees consistency of the metadata).
>>
>>>
>>> Even if not... I should be only a problem in case of a crash during
>>> that,.. and than I'd still prefer to get the false positive than bad
>>> data.
>>
>> How you can know if it is a "bad data" or a "bad checksum" ?
> You can't directly.  Just like you can't know which copy in a two-device MD RAID1 array is bad when they mismatch.
> 
> That's part of why I'm not all that fond of the idea of having checksums without COW, you need to verify the data using secondary means anyway, so why exactly should you waste time verifying it twice?

This is true

>>
>>>
>>> Anyway... it's not going to happen so the discussion is pointless.
>>> I think people can probably use dm-integrity (which btw: does no CoW
>>> either (IIRC) and still can provide integrity... ;-) ) to see whether
>>> their data is valid.
>>> No nice but since it won't change on btrfs, a possible alternative.
>>
>> Even in this case I am curious about dm-integrity would sole this issue.
> dm-integrity uses journaling, and actually based on the testing I've done, will typically have much worse performance than the overhead of just enabling COW on files on BTRFS and manually defragmenting them on a regular basis.

Good to know
>
Austin S. Hemmelgarn March 14, 2018, 7:27 p.m. UTC | #15
On 2018-03-14 14:39, Goffredo Baroncelli wrote:
> On 03/14/2018 01:02 PM, Austin S. Hemmelgarn wrote:
> [...]
>>>
>>> In btrfs, a checksum mismatch creates an -EIO error during the reading. In a conventional filesystem (or a btrfs filesystem w/o datasum) there is no checksum, so this problem doesn't exist.
>>>
>>> I am curious how ZFS solves this problem.
>> It doesn't support disabling COW or the O_DIRECT flag, so it just never has the problem in the first place.
> 
> I would like to perform some tests: however I think that you are right. if you make a "double buffering" approach (copy the data in the page cache, compute the checksum, then write the data to disk), the mismatch should not happen. Of course this is incompatible with O_DIRECT; but disabling O_DIRECT is a prerequisite to the "double buffering"; alone it couldn't be sufficient; what about mmap ? Are we sure that this does a double buffering ?
There's a whole lot of applications that would be showing some pretty 
serious issues if checksumming didn't work correctly with mmap(), so I 
think it does work correctly given that we don't have hordes of angry 
users and sysadmins beating down the doors.
> 
> I would prefer that btrfs doesn't allow O_DIRECT with the COW files. I prefer this to the checksum mismatch bug.
This is only reasonable if you are writing to the files.  Checksums 
appear to be checked on O_DIRECT reads, and outside of databases and 
VM's, read-only access accounts for a significant percentage of O_DIRECT 
usage, partly because it is needed for AIO support (nginx for example 
can serve files using AIO and O_DIRECT and gets a pretty serious 
performance boost on heavily loaded systems by doing so).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli March 14, 2018, 10:17 p.m. UTC | #16
On 03/14/2018 08:27 PM, Austin S. Hemmelgarn wrote:
> On 2018-03-14 14:39, Goffredo Baroncelli wrote:
>> On 03/14/2018 01:02 PM, Austin S. Hemmelgarn wrote:
>> [...]
>>>>
>>>> In btrfs, a checksum mismatch creates an -EIO error during the reading. In a conventional filesystem (or a btrfs filesystem w/o datasum) there is no checksum, so this problem doesn't exist.
>>>>
>>>> I am curious how ZFS solves this problem.
>>> It doesn't support disabling COW or the O_DIRECT flag, so it just never has the problem in the first place.
>>
>> I would like to perform some tests: however I think that you are right. if you make a "double buffering" approach (copy the data in the page cache, compute the checksum, then write the data to disk), the mismatch should not happen. Of course this is incompatible with O_DIRECT; but disabling O_DIRECT is a prerequisite to the "double buffering"; alone it couldn't be sufficient; what about mmap ? Are we sure that this does a double buffering ?
> There's a whole lot of applications that would be showing some pretty serious issues if checksumming didn't work correctly with mmap(), so I think it does work correctly given that we don't have hordes of angry users and sysadmins beating down the doors.

I tried to do in parallel updating a page and writing in different thread; I was unable to reproduce a checksum mismatch; so it seems that mmap are safe from this point of view;

>>
>> I would prefer that btrfs doesn't allow O_DIRECT with the COW files. I prefer this to the checksum mismatch bug.
> This is only reasonable if you are writing to the files.  Checksums appear to be checked on O_DIRECT reads, and outside of databases and VM's, read-only access accounts for a significant percentage of O_DIRECT usage, partly because it is needed for AIO support (nginx for example can serve files using AIO and O_DIRECT and gets a pretty serious performance boost on heavily loaded systems by doing so).
> 

So O_DIRECT should be unsupported/ignored only for the writing ? It could be a good compromise...

BR
G.Baroncelli

Patch
diff mbox

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 3a49a3c..9841fae 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2400,11 +2400,11 @@  void free_reloc_roots(struct list_head *list)
        while (!list_empty(list)) {
                reloc_root = list_entry(list->next, struct btrfs_root,
                                        root_list);
+               __del_reloc_root(reloc_root);
                free_extent_buffer(reloc_root->node);
                free_extent_buffer(reloc_root->commit_root);
                reloc_root->node = NULL;
                reloc_root->commit_root = NULL;
-               __del_reloc_root(reloc_root);
        }
 }