Message ID | 20150423151704.GA25585@ret.masoncoding.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Apr 23, 2015 at 4:17 PM, Chris Mason <clm@fb.com> wrote: > On Thu, Apr 23, 2015 at 02:05:48PM +0100, Filipe David Manana wrote: >> >> Trying the current integration-4.1 branch, I ran into the following >> >> during xfstests/btrfs/049: >> >> >> > >> > Ugh, I must not be waiting correctly in one of the inode cache writeout >> > sections. But I've run 049 a whole bunch of times without triggering, >> > can you get this to happen consistently? >> >> All the time so far. > > I'm testing with this now: > > commit 9f433238891b1b243c4f19d3f36eed913b270cbc > Author: Chris Mason <clm@fb.com> > Date: Thu Apr 23 08:02:49 2015 -0700 > > Btrfs: fix inode cache writeout > > The code to fix stalls during free spache cache IO wasn't using > the correct root when waiting on the IO for inode caches. This > is only a problem when the inode cache is enabled with > > mount -o inode_cache > > This fixes the inode cache writeout to preserve any error values and > makes sure not to override the root when inode cache writeout is done. > > Reported-by: Filipe Manana <fdmanana@suse.com> > Signed-off-by: Chris Mason <clm@fb.com> Thanks, btrfs/049 now passes with that patch applied. Running the whole xfstests suite now. > > diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c > index 5a4f5d1..8cd797f 100644 > --- a/fs/btrfs/free-space-cache.c > +++ b/fs/btrfs/free-space-cache.c > @@ -1149,7 +1149,8 @@ int btrfs_wait_cache_io(struct btrfs_root *root, > if (!inode) > return 0; > > - root = root->fs_info->tree_root; > + if (block_group) > + root = root->fs_info->tree_root; > > /* Flush the dirty pages in the cache file. */ > ret = flush_dirty_cache(inode); > @@ -3465,9 +3466,12 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root, > if (!btrfs_test_opt(root, INODE_MAP_CACHE)) > return 0; > > + memset(&io_ctl, 0, sizeof(io_ctl)); > ret = __btrfs_write_out_cache(root, inode, ctl, NULL, &io_ctl, > - trans, path, 0) || > - btrfs_wait_cache_io(root, trans, NULL, &io_ctl, path, 0); > + trans, path, 0); > + if (!ret) > + ret = btrfs_wait_cache_io(root, trans, NULL, &io_ctl, path, 0); > + > if (ret) { > btrfs_delalloc_release_metadata(inode, inode->i_size); > #ifdef DEBUG
On Thu, Apr 23, 2015 at 4:48 PM, Filipe David Manana <fdmanana@gmail.com> wrote: > On Thu, Apr 23, 2015 at 4:17 PM, Chris Mason <clm@fb.com> wrote: >> On Thu, Apr 23, 2015 at 02:05:48PM +0100, Filipe David Manana wrote: >>> >> Trying the current integration-4.1 branch, I ran into the following >>> >> during xfstests/btrfs/049: >>> >> >>> > >>> > Ugh, I must not be waiting correctly in one of the inode cache writeout >>> > sections. But I've run 049 a whole bunch of times without triggering, >>> > can you get this to happen consistently? >>> >>> All the time so far. >> >> I'm testing with this now: >> >> commit 9f433238891b1b243c4f19d3f36eed913b270cbc >> Author: Chris Mason <clm@fb.com> >> Date: Thu Apr 23 08:02:49 2015 -0700 >> >> Btrfs: fix inode cache writeout >> >> The code to fix stalls during free spache cache IO wasn't using >> the correct root when waiting on the IO for inode caches. This >> is only a problem when the inode cache is enabled with >> >> mount -o inode_cache >> >> This fixes the inode cache writeout to preserve any error values and >> makes sure not to override the root when inode cache writeout is done. >> >> Reported-by: Filipe Manana <fdmanana@suse.com> >> Signed-off-by: Chris Mason <clm@fb.com> > > Thanks, btrfs/049 now passes with that patch applied. > Running the whole xfstests suite now. btrfs/066 also failed once during final fsck with: _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent *** fsck.btrfs output *** checking extents checking free space cache There is no free space entry for 21676032-21680128 There is no free space entry for 21676032-87031808 cache appears valid but isnt 20971520 Checking filesystem on /dev/sdc UUID: f7785aa7-d5ba-479d-a211-7c31039dc9b1 found 11911316 bytes used err is -22 total csum bytes: 7656 total tree bytes: 454656 total fs tree bytes: 376832 total extent tree bytes: 36864 btree space waste bytes: 122959 file data blocks allocated: 42893312 referenced 31158272 (it failed like that 1 out of 4 runs) > >> >> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c >> index 5a4f5d1..8cd797f 100644 >> --- a/fs/btrfs/free-space-cache.c >> +++ b/fs/btrfs/free-space-cache.c >> @@ -1149,7 +1149,8 @@ int btrfs_wait_cache_io(struct btrfs_root *root, >> if (!inode) >> return 0; >> >> - root = root->fs_info->tree_root; >> + if (block_group) >> + root = root->fs_info->tree_root; >> >> /* Flush the dirty pages in the cache file. */ >> ret = flush_dirty_cache(inode); >> @@ -3465,9 +3466,12 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root, >> if (!btrfs_test_opt(root, INODE_MAP_CACHE)) >> return 0; >> >> + memset(&io_ctl, 0, sizeof(io_ctl)); >> ret = __btrfs_write_out_cache(root, inode, ctl, NULL, &io_ctl, >> - trans, path, 0) || >> - btrfs_wait_cache_io(root, trans, NULL, &io_ctl, path, 0); >> + trans, path, 0); >> + if (!ret) >> + ret = btrfs_wait_cache_io(root, trans, NULL, &io_ctl, path, 0); >> + >> if (ret) { >> btrfs_delalloc_release_metadata(inode, inode->i_size); >> #ifdef DEBUG > > > > -- > Filipe David Manana, > > "Reasonable men adapt themselves to the world. > Unreasonable men adapt the world to themselves. > That's why all progress depends on unreasonable men."
On 04/23/2015 03:43 PM, Filipe David Manana wrote: > On Thu, Apr 23, 2015 at 4:48 PM, Filipe David Manana <fdmanana@gmail.com> wrote: >> On Thu, Apr 23, 2015 at 4:17 PM, Chris Mason <clm@fb.com> wrote: >>> On Thu, Apr 23, 2015 at 02:05:48PM +0100, Filipe David Manana wrote: >>>>>> Trying the current integration-4.1 branch, I ran into the following >>>>>> during xfstests/btrfs/049: >>>>>> >>>>> >>>>> Ugh, I must not be waiting correctly in one of the inode cache writeout >>>>> sections. But I've run 049 a whole bunch of times without triggering, >>>>> can you get this to happen consistently? >>>> >>>> All the time so far. >>> >>> I'm testing with this now: >>> >>> commit 9f433238891b1b243c4f19d3f36eed913b270cbc >>> Author: Chris Mason <clm@fb.com> >>> Date: Thu Apr 23 08:02:49 2015 -0700 >>> >>> Btrfs: fix inode cache writeout >>> >>> The code to fix stalls during free spache cache IO wasn't using >>> the correct root when waiting on the IO for inode caches. This >>> is only a problem when the inode cache is enabled with >>> >>> mount -o inode_cache >>> >>> This fixes the inode cache writeout to preserve any error values and >>> makes sure not to override the root when inode cache writeout is done. >>> >>> Reported-by: Filipe Manana <fdmanana@suse.com> >>> Signed-off-by: Chris Mason <clm@fb.com> >> >> Thanks, btrfs/049 now passes with that patch applied. >> Running the whole xfstests suite now. > > btrfs/066 also failed once during final fsck with: > > _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent > *** fsck.btrfs output *** > checking extents > checking free space cache > There is no free space entry for 21676032-21680128 > There is no free space entry for 21676032-87031808 > cache appears valid but isnt 20971520 Josef has a btrfs-progs patch for this. The kernel will toss the cache. There's a somewhat fundamental race in cache writeout this patch makes a little bigger, but it has always been there. (compare what find_free_extent can do with no trans running vs the actual cache writeback) -chris > Checking filesystem on /dev/sdc > UUID: f7785aa7-d5ba-479d-a211-7c31039dc9b1 > found 11911316 bytes used err is -22 > total csum bytes: 7656 > total tree bytes: 454656 > total fs tree bytes: 376832 > total extent tree bytes: 36864 > btree space waste bytes: 122959 > file data blocks allocated: 42893312 > referenced 31158272 > > (it failed like that 1 out of 4 runs) > > >> >>> >>> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c >>> index 5a4f5d1..8cd797f 100644 >>> --- a/fs/btrfs/free-space-cache.c >>> +++ b/fs/btrfs/free-space-cache.c >>> @@ -1149,7 +1149,8 @@ int btrfs_wait_cache_io(struct btrfs_root *root, >>> if (!inode) >>> return 0; >>> >>> - root = root->fs_info->tree_root; >>> + if (block_group) >>> + root = root->fs_info->tree_root; >>> >>> /* Flush the dirty pages in the cache file. */ >>> ret = flush_dirty_cache(inode); >>> @@ -3465,9 +3466,12 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root, >>> if (!btrfs_test_opt(root, INODE_MAP_CACHE)) >>> return 0; >>> >>> + memset(&io_ctl, 0, sizeof(io_ctl)); >>> ret = __btrfs_write_out_cache(root, inode, ctl, NULL, &io_ctl, >>> - trans, path, 0) || >>> - btrfs_wait_cache_io(root, trans, NULL, &io_ctl, path, 0); >>> + trans, path, 0); >>> + if (!ret) >>> + ret = btrfs_wait_cache_io(root, trans, NULL, &io_ctl, path, 0); >>> + >>> if (ret) { >>> btrfs_delalloc_release_metadata(inode, inode->i_size); >>> #ifdef DEBUG >> >> >> >> -- >> Filipe David Manana, >> >> "Reasonable men adapt themselves to the world. >> Unreasonable men adapt the world to themselves. >> That's why all progress depends on unreasonable men." > > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 23, 2015 at 8:50 PM, Chris Mason <clm@fb.com> wrote: > On 04/23/2015 03:43 PM, Filipe David Manana wrote: >> On Thu, Apr 23, 2015 at 4:48 PM, Filipe David Manana <fdmanana@gmail.com> wrote: >>> On Thu, Apr 23, 2015 at 4:17 PM, Chris Mason <clm@fb.com> wrote: >>>> On Thu, Apr 23, 2015 at 02:05:48PM +0100, Filipe David Manana wrote: >>>>>>> Trying the current integration-4.1 branch, I ran into the following >>>>>>> during xfstests/btrfs/049: >>>>>>> >>>>>> >>>>>> Ugh, I must not be waiting correctly in one of the inode cache writeout >>>>>> sections. But I've run 049 a whole bunch of times without triggering, >>>>>> can you get this to happen consistently? >>>>> >>>>> All the time so far. >>>> >>>> I'm testing with this now: >>>> >>>> commit 9f433238891b1b243c4f19d3f36eed913b270cbc >>>> Author: Chris Mason <clm@fb.com> >>>> Date: Thu Apr 23 08:02:49 2015 -0700 >>>> >>>> Btrfs: fix inode cache writeout >>>> >>>> The code to fix stalls during free spache cache IO wasn't using >>>> the correct root when waiting on the IO for inode caches. This >>>> is only a problem when the inode cache is enabled with >>>> >>>> mount -o inode_cache >>>> >>>> This fixes the inode cache writeout to preserve any error values and >>>> makes sure not to override the root when inode cache writeout is done. >>>> >>>> Reported-by: Filipe Manana <fdmanana@suse.com> >>>> Signed-off-by: Chris Mason <clm@fb.com> >>> >>> Thanks, btrfs/049 now passes with that patch applied. >>> Running the whole xfstests suite now. >> >> btrfs/066 also failed once during final fsck with: >> >> _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent >> *** fsck.btrfs output *** >> checking extents >> checking free space cache >> There is no free space entry for 21676032-21680128 >> There is no free space entry for 21676032-87031808 >> cache appears valid but isnt 20971520 > > Josef has a btrfs-progs patch for this. The kernel will toss the cache. > There's a somewhat fundamental race in cache writeout this patch makes > a little bigger, but it has always been there. > > (compare what find_free_extent can do with no trans running vs the > actual cache writeback) There's also one list corruption I didn't get before and happened while running fsstress (btrfs/078), apparently due to some race: [25590.799058] ------------[ cut here ]------------ [25590.800204] WARNING: CPU: 3 PID: 7280 at lib/list_debug.c:62 __list_del_entry+0x5a/0x98() [25590.802101] list_del corruption. next->prev should be ffff8801a0f74d50, but was a56b6b6b6b6b6b6b [25590.804236] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse i2c_piix4 i2c_core psmouse serio_raw evdev parport_pc parport acpi_cpufreq processor button pcspkr thermal_sys microcode ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic virtio_pci virtio_ring ata_piix e1000 virtio libata floppy scsi_mod [last unloaded: btrfs] [25590.818580] CPU: 3 PID: 7280 Comm: fsstress Tainted: G W 4.0.0-rc5-btrfs-next-9+ #1 [25590.820597] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [25590.823458] 0000000000000009 ffff8803f031bc08 ffffffff8142fa46 ffffffff8108b6a2 [25590.825081] ffff8803f031bc58 ffff8803f031bc48 ffffffff81045ea5 0000000000000011 [25590.826568] ffffffff81245af7 ffff8801a0f74d50 ffff8801a0f74460 ffff880041710df0 [25590.828106] Call Trace: [25590.828630] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [25590.829706] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [25590.830785] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [25590.831957] [<ffffffff81245af7>] ? __list_del_entry+0x5a/0x98 [25590.867473] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [25590.868631] [<ffffffffa04907e2>] ? btrfs_csum_data+0x16/0x18 [btrfs] [25590.869524] [<ffffffff81245af7>] __list_del_entry+0x5a/0x98 [25590.870918] [<ffffffffa04cff5f>] write_bitmap_entries+0x99/0xbd [btrfs] [25590.872377] [<ffffffffa04d0858>] ? __btrfs_write_out_cache.isra.21+0x20b/0x3a1 [btrfs] [25590.874079] [<ffffffffa04d0864>] __btrfs_write_out_cache.isra.21+0x217/0x3a1 [btrfs] [25590.875594] [<ffffffffa04d1269>] ? btrfs_write_out_cache+0x41/0xdc [btrfs] [25590.877032] [<ffffffffa04d12bb>] btrfs_write_out_cache+0x93/0xdc [btrfs] [25590.878406] [<ffffffffa04888e7>] ? btrfs_start_dirty_block_groups+0x156/0x29b [btrfs] [25590.879859] [<ffffffffa0488977>] btrfs_start_dirty_block_groups+0x1e6/0x29b [btrfs] [25590.881360] [<ffffffffa04970f2>] btrfs_commit_transaction+0x130/0x9c9 [btrfs] [25590.882504] [<ffffffffa0471748>] btrfs_sync_fs+0xe1/0x12d [btrfs] [25590.883600] [<ffffffff811569d3>] ? iterate_supers+0x60/0xc4 [25590.884671] [<ffffffff8117acda>] ? vfs_fsync+0x1e/0x1e [25590.885640] [<ffffffff8117acfa>] sync_fs_one_sb+0x20/0x22 [25590.886619] [<ffffffff811569e9>] iterate_supers+0x76/0xc4 [25590.887586] [<ffffffff8117af58>] sys_sync+0x55/0x83 [25590.888534] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [25590.889694] ---[ end trace d1235b0201a01949 ]--- [25590.890573] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [25590.891870] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse i2c_piix4 i2c_core psmouse serio_raw evdev parport_pc parport acpi_cpufreq processor button pcspkr thermal_sys microcode ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic virtio_pci virtio_ring ata_piix e1000 virtio libata floppy scsi_mod [last unloaded: btrfs] [25590.892522] CPU: 3 PID: 7280 Comm: fsstress Tainted: G W 4.0.0-rc5-btrfs-next-9+ #1 [25590.892522] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [25590.892522] task: ffff88040f485bd0 ti: ffff8803f0318000 task.ti: ffff8803f0318000 [25590.892522] RIP: 0010:[<ffffffffa04cff2c>] [<ffffffffa04cff2c>] write_bitmap_entries+0x66/0xbd [btrfs] [25590.892522] RSP: 0018:ffff8803f031bcc8 EFLAGS: 00010246 [25590.892522] RAX: 0000000000000011 RBX: ffff8801a0f74460 RCX: 0000000000000400 [25590.892522] RDX: ffff8804062e8000 RSI: 6b6b6b6b6b6b6b6b RDI: ffff8804062e8000 [25590.892522] RBP: ffff8803f031bcf8 R08: 0000000000000001 R09: 0000000000000000 [25590.892522] R10: 0000000000000000 R11: ffffffff8165a180 R12: 6b6b6b6b6b6b6b6b [25590.892522] R13: ffff880041710df0 R14: 6b6b6b6b6b6b6b6b R15: ffff8803f031bd38 [25590.892522] FS: 00007f6a2fd9f700(0000) GS:ffff88043dd80000(0000) knlGS:0000000000000000 [25590.892522] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [25590.892522] CR2: 00007f6a280d9ad8 CR3: 00000003c65c5000 CR4: 00000000000006e0 [25590.892522] Stack: [25590.892522] ffffffffa04d0858 ffff880041710df0 ffff880041710c00 ffff88033ba59c90 [25590.892522] 0000000000000000 ffff8803efb6ff70 ffff8803f031bd78 ffffffffa04d0864 [25590.892522] ffff88037cae7f40 ffff88033ba599d0 ffff880226cd4000 ffff8803efb6ff00 [25590.892522] Call Trace: [25590.892522] [<ffffffffa04d0858>] ? __btrfs_write_out_cache.isra.21+0x20b/0x3a1 [btrfs] [25590.892522] [<ffffffffa04d0864>] __btrfs_write_out_cache.isra.21+0x217/0x3a1 [btrfs] [25590.892522] [<ffffffffa04d1269>] ? btrfs_write_out_cache+0x41/0xdc [btrfs] [25590.892522] [<ffffffffa04d12bb>] btrfs_write_out_cache+0x93/0xdc [btrfs] [25590.892522] [<ffffffffa04888e7>] ? btrfs_start_dirty_block_groups+0x156/0x29b [btrfs] [25590.892522] [<ffffffffa0488977>] btrfs_start_dirty_block_groups+0x1e6/0x29b [btrfs] [25590.892522] [<ffffffffa04970f2>] btrfs_commit_transaction+0x130/0x9c9 [btrfs] [25590.892522] [<ffffffffa0471748>] btrfs_sync_fs+0xe1/0x12d [btrfs] [25590.892522] [<ffffffff811569d3>] ? iterate_supers+0x60/0xc4 [25590.892522] [<ffffffff8117acda>] ? vfs_fsync+0x1e/0x1e [25590.892522] [<ffffffff8117acfa>] sync_fs_one_sb+0x20/0x22 [25590.892522] [<ffffffff811569e9>] iterate_supers+0x76/0xc4 [25590.892522] [<ffffffff8117af58>] sys_sync+0x55/0x83 [25590.892522] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [25590.892522] Code: 4c 89 ef 8d 70 ff e8 a4 ec ff ff 41 8b 45 3c 41 39 45 38 7d 5c 31 f6 4c 89 ef e8 a9 f5 ff ff 49 8b 7d 00 4c 89 f6 b9 00 04 00 00 <f3> a5 4c 89 ef 41 8b 45 38 8d 70 ff e8 73 ec ff ff 41 8b 45 3c [25590.892522] RIP [<ffffffffa04cff2c>] write_bitmap_entries+0x66/0xbd [btrfs] [25590.892522] RSP <ffff8803f031bcc8> [25590.951443] ---[ end trace d1235b0201a0194a ]--- [25590.958107] BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:41 [25590.959805] in_atomic(): 1, irqs_disabled(): 0, pid: 7280, name: fsstress [25590.961219] INFO: lockdep is turned off. [25590.962095] Preemption disabled at:[<ffffffffa04d0858>] __btrfs_write_out_cache.isra.21+0x20b/0x3a1 [btrfs] [25590.964190] [25590.964675] CPU: 3 PID: 7280 Comm: fsstress Tainted: G D W 4.0.0-rc5-btrfs-next-9+ #1 [25590.966441] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [25590.969050] 0000000000001c70 ffff8803f031ba28 ffffffff8142fa46 ffffffff8108b6a2 [25590.971014] 0000000000000000 ffff8803f031ba58 ffffffff8106674b 00007ffffffff000 [25590.973141] ffffffff817d1e02 0000000000000029 0000000000000000 ffff8803f031ba88 [25590.975044] Call Trace: [25590.975635] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [25590.976733] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [25590.977956] [<ffffffff8106674b>] ___might_sleep+0x1b9/0x1c1 [25590.979252] [<ffffffff810667f3>] __might_sleep+0xa0/0xa8 [25590.989557] [<ffffffff81433ce9>] down_read+0x21/0x50 [25590.990930] [<ffffffff810528fb>] exit_signals+0x26/0x11a [25590.992152] [<ffffffff810603d0>] ? blocking_notifier_call_chain+0x14/0x16 [25591.002069] [<ffffffff8104727e>] do_exit+0x128/0x9cb [25591.002719] [<ffffffff8107b024>] ? arch_local_irq_save+0x9/0xc [25591.003542] [<ffffffff8108c36d>] ? kmsg_dump+0xd2/0xf8 [25591.004677] [<ffffffff8108c38a>] ? kmsg_dump+0xef/0xf8 [25591.005763] [<ffffffff8100594c>] oops_end+0xa6/0xae [25591.006797] [<ffffffff81005dd1>] die+0x5a/0x63 [25591.007760] [<ffffffff81002dc0>] do_general_protection+0x97/0x142 [25591.009032] [<ffffffff81437812>] general_protection+0x22/0x30 [25591.010245] [<ffffffffa04cff2c>] ? write_bitmap_entries+0x66/0xbd [btrfs] [25591.012880] [<ffffffffa04cff5f>] ? write_bitmap_entries+0x99/0xbd [btrfs] [25591.013686] [<ffffffffa04d0858>] ? __btrfs_write_out_cache.isra.21+0x20b/0x3a1 [btrfs] [25591.014709] [<ffffffffa04d0864>] __btrfs_write_out_cache.isra.21+0x217/0x3a1 [btrfs] [25591.016398] [<ffffffffa04d1269>] ? btrfs_write_out_cache+0x41/0xdc [btrfs] [25591.017207] [<ffffffffa04d12bb>] btrfs_write_out_cache+0x93/0xdc [btrfs] [25591.018002] [<ffffffffa04888e7>] ? btrfs_start_dirty_block_groups+0x156/0x29b [btrfs] [25591.019000] [<ffffffffa0488977>] btrfs_start_dirty_block_groups+0x1e6/0x29b [btrfs] [25591.019980] [<ffffffffa04970f2>] btrfs_commit_transaction+0x130/0x9c9 [btrfs] [25591.020944] [<ffffffffa0471748>] btrfs_sync_fs+0xe1/0x12d [btrfs] [25591.021679] [<ffffffff811569d3>] ? iterate_supers+0x60/0xc4 [25591.022693] [<ffffffff8117acda>] ? vfs_fsync+0x1e/0x1e [25591.023767] [<ffffffff8117acfa>] sync_fs_one_sb+0x20/0x22 [25591.024931] [<ffffffff811569e9>] iterate_supers+0x76/0xc4 [25591.026050] [<ffffffff8117af58>] sys_sync+0x55/0x83 [25591.027084] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 (....) > > -chris > >> Checking filesystem on /dev/sdc >> UUID: f7785aa7-d5ba-479d-a211-7c31039dc9b1 >> found 11911316 bytes used err is -22 >> total csum bytes: 7656 >> total tree bytes: 454656 >> total fs tree bytes: 376832 >> total extent tree bytes: 36864 >> btree space waste bytes: 122959 >> file data blocks allocated: 42893312 >> referenced 31158272 >> >> (it failed like that 1 out of 4 runs) >> >> >>> >>>> >>>> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c >>>> index 5a4f5d1..8cd797f 100644 >>>> --- a/fs/btrfs/free-space-cache.c >>>> +++ b/fs/btrfs/free-space-cache.c >>>> @@ -1149,7 +1149,8 @@ int btrfs_wait_cache_io(struct btrfs_root *root, >>>> if (!inode) >>>> return 0; >>>> >>>> - root = root->fs_info->tree_root; >>>> + if (block_group) >>>> + root = root->fs_info->tree_root; >>>> >>>> /* Flush the dirty pages in the cache file. */ >>>> ret = flush_dirty_cache(inode); >>>> @@ -3465,9 +3466,12 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root, >>>> if (!btrfs_test_opt(root, INODE_MAP_CACHE)) >>>> return 0; >>>> >>>> + memset(&io_ctl, 0, sizeof(io_ctl)); >>>> ret = __btrfs_write_out_cache(root, inode, ctl, NULL, &io_ctl, >>>> - trans, path, 0) || >>>> - btrfs_wait_cache_io(root, trans, NULL, &io_ctl, path, 0); >>>> + trans, path, 0); >>>> + if (!ret) >>>> + ret = btrfs_wait_cache_io(root, trans, NULL, &io_ctl, path, 0); >>>> + >>>> if (ret) { >>>> btrfs_delalloc_release_metadata(inode, inode->i_size); >>>> #ifdef DEBUG >>> >>> >>> >>> -- >>> Filipe David Manana, >>> >>> "Reasonable men adapt themselves to the world. >>> Unreasonable men adapt the world to themselves. >>> That's why all progress depends on unreasonable men." >> >> >> >
On 04/24/2015 02:34 AM, Filipe David Manana wrote: > On Thu, Apr 23, 2015 at 8:50 PM, Chris Mason <clm@fb.com> wrote: >> On 04/23/2015 03:43 PM, Filipe David Manana wrote: >>> On Thu, Apr 23, 2015 at 4:48 PM, Filipe David Manana <fdmanana@gmail.com> wrote: >>>> On Thu, Apr 23, 2015 at 4:17 PM, Chris Mason <clm@fb.com> wrote: >>>>> On Thu, Apr 23, 2015 at 02:05:48PM +0100, Filipe David Manana wrote: >>>>>>>> Trying the current integration-4.1 branch, I ran into the following >>>>>>>> during xfstests/btrfs/049: >>>>>>>> >>>>>>> >>>>>>> Ugh, I must not be waiting correctly in one of the inode cache writeout >>>>>>> sections. But I've run 049 a whole bunch of times without triggering, >>>>>>> can you get this to happen consistently? >>>>>> >>>>>> All the time so far. >>>>> >>>>> I'm testing with this now: >>>>> >>>>> commit 9f433238891b1b243c4f19d3f36eed913b270cbc >>>>> Author: Chris Mason <clm@fb.com> >>>>> Date: Thu Apr 23 08:02:49 2015 -0700 >>>>> >>>>> Btrfs: fix inode cache writeout >>>>> >>>>> The code to fix stalls during free spache cache IO wasn't using >>>>> the correct root when waiting on the IO for inode caches. This >>>>> is only a problem when the inode cache is enabled with >>>>> >>>>> mount -o inode_cache >>>>> >>>>> This fixes the inode cache writeout to preserve any error values and >>>>> makes sure not to override the root when inode cache writeout is done. >>>>> >>>>> Reported-by: Filipe Manana <fdmanana@suse.com> >>>>> Signed-off-by: Chris Mason <clm@fb.com> >>>> >>>> Thanks, btrfs/049 now passes with that patch applied. >>>> Running the whole xfstests suite now. >>> >>> btrfs/066 also failed once during final fsck with: >>> >>> _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent >>> *** fsck.btrfs output *** >>> checking extents >>> checking free space cache >>> There is no free space entry for 21676032-21680128 >>> There is no free space entry for 21676032-87031808 >>> cache appears valid but isnt 20971520 >> >> Josef has a btrfs-progs patch for this. The kernel will toss the cache. >> There's a somewhat fundamental race in cache writeout this patch makes >> a little bigger, but it has always been there. >> >> (compare what find_free_extent can do with no trans running vs the >> actual cache writeback) > > There's also one list corruption I didn't get before and happened > while running fsstress (btrfs/078), apparently due to some race: Can you please bang on this and get a more reliable reproduction? I'll take a look. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 24, 2015 at 2:00 PM, Chris Mason <clm@fb.com> wrote: > On 04/24/2015 02:34 AM, Filipe David Manana wrote: >> On Thu, Apr 23, 2015 at 8:50 PM, Chris Mason <clm@fb.com> wrote: >>> On 04/23/2015 03:43 PM, Filipe David Manana wrote: >>>> On Thu, Apr 23, 2015 at 4:48 PM, Filipe David Manana <fdmanana@gmail.com> wrote: >>>>> On Thu, Apr 23, 2015 at 4:17 PM, Chris Mason <clm@fb.com> wrote: >>>>>> On Thu, Apr 23, 2015 at 02:05:48PM +0100, Filipe David Manana wrote: >>>>>>>>> Trying the current integration-4.1 branch, I ran into the following >>>>>>>>> during xfstests/btrfs/049: >>>>>>>>> >>>>>>>> >>>>>>>> Ugh, I must not be waiting correctly in one of the inode cache writeout >>>>>>>> sections. But I've run 049 a whole bunch of times without triggering, >>>>>>>> can you get this to happen consistently? >>>>>>> >>>>>>> All the time so far. >>>>>> >>>>>> I'm testing with this now: >>>>>> >>>>>> commit 9f433238891b1b243c4f19d3f36eed913b270cbc >>>>>> Author: Chris Mason <clm@fb.com> >>>>>> Date: Thu Apr 23 08:02:49 2015 -0700 >>>>>> >>>>>> Btrfs: fix inode cache writeout >>>>>> >>>>>> The code to fix stalls during free spache cache IO wasn't using >>>>>> the correct root when waiting on the IO for inode caches. This >>>>>> is only a problem when the inode cache is enabled with >>>>>> >>>>>> mount -o inode_cache >>>>>> >>>>>> This fixes the inode cache writeout to preserve any error values and >>>>>> makes sure not to override the root when inode cache writeout is done. >>>>>> >>>>>> Reported-by: Filipe Manana <fdmanana@suse.com> >>>>>> Signed-off-by: Chris Mason <clm@fb.com> >>>>> >>>>> Thanks, btrfs/049 now passes with that patch applied. >>>>> Running the whole xfstests suite now. >>>> >>>> btrfs/066 also failed once during final fsck with: >>>> >>>> _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent >>>> *** fsck.btrfs output *** >>>> checking extents >>>> checking free space cache >>>> There is no free space entry for 21676032-21680128 >>>> There is no free space entry for 21676032-87031808 >>>> cache appears valid but isnt 20971520 >>> >>> Josef has a btrfs-progs patch for this. The kernel will toss the cache. >>> There's a somewhat fundamental race in cache writeout this patch makes >>> a little bigger, but it has always been there. >>> >>> (compare what find_free_extent can do with no trans running vs the >>> actual cache writeback) >> >> There's also one list corruption I didn't get before and happened >> while running fsstress (btrfs/078), apparently due to some race: > > Can you please bang on this and get a more reliable reproduction? I'll > take a look. Not really that easy to get a more reliable reproducer - just run fsstress with multiple processes - it already happened twice again after I sent the previous mail. From the quick look I had at this, this seems to be the change causing the problem: http://git.kernel.org/cgit/linux/kernel/git/mason/linux-btrfs.git/commit/?h=for-linus-4.1&id=1bbc621ef28462456131c035eaeb5567a1a2a2fe Early in btrfs_commit_transaction(), btrfs_start_dirty_block_groups() is called which ends up calling __btrfs_write_out_cache() for each dirty block group, which collects all the bitmap entries from the bg's space cache into a local list while holding the cache's ctl->tree_lock (to serialize with concurrent allocation requests). Then we unlock ctl->tree_lock, do other stuff and later acquire ctl->tree_lock again and call write_bitmap_entries() to write the bitmap entries we previously collected. However, while we were doing the other stuff without holding that lock, allocation requests might have happened right? - since when we call btrfs_start_dirty_block_groups() in btrfs_commit_transaction() the transaction state wasn't yet changed, allowing other tasks to join the current transaction. If such other task allocates all the remaining space from a bitmap entry we collected before (because it's still in the space cache's rbtree), it ends up deleting it and freeing its ->bitmap member, which results in an invalid memory access (and the warning on the list corruption) when we later call write_bitmap_entries() in __btrfs_write_out_cache() - which is what the second part of the trace I sent says: [25590.890573] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC (...) [25590.892522] CPU: 3 PID: 7280 Comm: fsstress Tainted: G W 4.0.0-rc5-btrfs-next-9+ #1 [25590.892522] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [25590.892522] task: ffff88040f485bd0 ti: ffff8803f0318000 task.ti: ffff8803f0318000 [25590.892522] RIP: 0010:[<ffffffffa04cff2c>] [<ffffffffa04cff2c>] write_bitmap_entries+0x66/0xbd [btrfs] (...) [25590.892522] Stack: [25590.892522] ffffffffa04d0858 ffff880041710df0 ffff880041710c00 ffff88033ba59c90 [25590.892522] 0000000000000000 ffff8803efb6ff70 ffff8803f031bd78 ffffffffa04d0864 [25590.892522] ffff88037cae7f40 ffff88033ba599d0 ffff880226cd4000 ffff8803efb6ff00 [25590.892522] Call Trace: [25590.892522] [<ffffffffa04d0858>] ? __btrfs_write_out_cache.isra.21+0x20b/0x3a1 [btrfs] [25590.892522] [<ffffffffa04d0864>] __btrfs_write_out_cache.isra.21+0x217/0x3a1 [btrfs] [25590.892522] [<ffffffffa04d1269>] ? btrfs_write_out_cache+0x41/0xdc [btrfs] [25590.892522] [<ffffffffa04d12bb>] btrfs_write_out_cache+0x93/0xdc [btrfs] [25590.892522] [<ffffffffa04888e7>] ? btrfs_start_dirty_block_groups+0x156/0x29b [btrfs] [25590.892522] [<ffffffffa0488977>] btrfs_start_dirty_block_groups+0x1e6/0x29b [btrfs] [25590.892522] [<ffffffffa04970f2>] btrfs_commit_transaction+0x130/0x9c9 [btrfs] And write_bitmap_entries+0x66 corresponds to: static int io_ctl_add_bitmap(struct btrfs_io_ctl *io_ctl, void *bitmap) { (...) memcpy(io_ctl->cur, bitmap, PAGE_CACHE_SIZE); (...) } I.e. the ->bitmap field from an entry we pass from write_bitmap_entries(). So releasing the tree_lock after collecting the bitmaps and before writing them out seems to me what leads to the race and invalid memory access. But I might be wrong, I haven't spent more than 5 minutes analyzing it. > > -chris >
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 5a4f5d1..8cd797f 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -1149,7 +1149,8 @@ int btrfs_wait_cache_io(struct btrfs_root *root, if (!inode) return 0; - root = root->fs_info->tree_root; + if (block_group) + root = root->fs_info->tree_root; /* Flush the dirty pages in the cache file. */ ret = flush_dirty_cache(inode); @@ -3465,9 +3466,12 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root, if (!btrfs_test_opt(root, INODE_MAP_CACHE)) return 0; + memset(&io_ctl, 0, sizeof(io_ctl)); ret = __btrfs_write_out_cache(root, inode, ctl, NULL, &io_ctl, - trans, path, 0) || - btrfs_wait_cache_io(root, trans, NULL, &io_ctl, path, 0); + trans, path, 0); + if (!ret) + ret = btrfs_wait_cache_io(root, trans, NULL, &io_ctl, path, 0); + if (ret) { btrfs_delalloc_release_metadata(inode, inode->i_size); #ifdef DEBUG