Message ID | 1461546072-11154-4-git-send-email-fdmanana@kernel.org (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
On Mon, Apr 25, 2016 at 02:01:12AM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > Before we start the actual relocation process of a block group, we do > calls to flush delalloc of all inodes and then wait for ordered extents > to complete. However we do these flush calls just to make sure we don't > race with concurrent tasks that have actually already started to run > delalloc and have allocated an extent from the block group we want to > relocate, right before we set it to readonly mode, but have not yet > created the respective ordered extents. The flush calls make us wait > for such concurrent tasks because they end up calling > filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() -> > __start_delalloc_inodes() -> btrfs_alloc_delalloc_work() -> > btrfs_run_delalloc_work()) which ends up serializing us with those tasks > due to attempts to lock the same pages (and the delalloc flush procedure > calls the allocator and creates the ordered extents before unlocking the > pages). > > These flushing calls not only make us waste time (cpu, IO) but also reduce > the chances of writing larger extents (applications might be writing to > contiguous ranges and we flush before they finish dirtying the whole > ranges). > > So make sure we don't flush delalloc and just wait for concurrent tasks > that have already started flushing delalloc and have allocated an extent > from the block group we are about to relocate. > > This change also ends up fixing a race with direct IO writes that makes > relocation not wait for direct IO ordered extents. This race is > illustrated by the following diagram: > > CPU 1 CPU 2 > > btrfs_relocate_block_group(bg X) > > starts direct IO write, > target inode currently has no > ordered extents ongoing nor > dirty pages (delalloc regions), > therefore the root for our inode > is not in the list > fs_info->ordered_roots > > btrfs_direct_IO() > __blockdev_direct_IO() > btrfs_get_blocks_direct() > btrfs_lock_extent_direct() > locks range in the io tree > btrfs_new_extent_direct() > btrfs_reserve_extent() > --> extent allocated > from bg X > > btrfs_inc_block_group_ro(bg X) > > btrfs_start_delalloc_roots() > __start_delalloc_inodes() > --> does nothing, no dealloc ranges > in the inode's io tree so the > inode's root is not in the list > fs_info->delalloc_roots > > btrfs_wait_ordered_roots() > --> does not find the inode's root in the list > fs_info->ordered_roots > > --> ends up not waiting for the direct IO > write started by the task at CPU 2 > > relocate_block_group() > > btrfs_add_ordered_extent_dio() > --> now a ordered extent is > created and added to the > list root->ordered_extents > and the root added to the > list fs_info->ordered_roots > --> this is too late and the > task at CPU 1 already > started the relocation > > The consequence is the same as described in the patch titled > "Btrfs: fix race in relocation that makes us miss extents". When we hit > it, the relocation process produces the following message: > > [ 7260.832836] ------------[ cut here ]------------ > [ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]() > [ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc > [ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1 > [ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 > [ 7260.852998] 0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000 > [ 7260.852998] 0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d > [ 7260.852998] ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000 > [ 7260.852998] Call Trace: > [ 7260.852998] [<ffffffff812648b3>] dump_stack+0x67/0x90 > [ 7260.852998] [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2 > [ 7260.852998] [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs] > [ 7260.852998] [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c > [ 7260.852998] [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs] > [ 7260.852998] [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs] > [ 7260.852998] [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs] > [ 7260.852998] [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19 > [ 7260.852998] [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs] > [ 7260.852998] [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs] > [ 7260.852998] [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63 > [ 7260.852998] [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44 > [ 7260.852998] [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc > [ 7260.852998] [<ffffffff811876ab>] vfs_ioctl+0x18/0x34 > [ 7260.852998] [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be > [ 7260.852998] [<ffffffff81190c30>] ? __fget_light+0x4d/0x71 > [ 7260.852998] [<ffffffff81187d77>] SyS_ioctl+0x57/0x79 > [ 7260.852998] [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b > [ 7260.893268] ---[ end trace eb7803b24ebab8ad ]--- > > It is triggered because after the first stage of the relocation (rc->stage > == MOVE_DATA_EXTENTS), we commit the current transaction and then the > second time we call relocate_block_group() (rc->stage == UPDATE_DATA_PTRS), > we have flushed and waited for delalloc on the relocation inode, but since > we didn't find and move the extent in the first stage, the block group > still has a non zero number of used bytes and therefore it triggers a > warning at the end of btrfs_relocate_block_group(). > > Later on when trying to read the extent contents from disk we hit a > BUG_ON() due to the inability to map a block with a logical address that > belongs to the block group we relocated and is no longer valid, resulting > in the following trace: > > [ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096 > [ 7344.887518] ------------[ cut here ]------------ > [ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833! > [ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > [ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc > [ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G W 4.5.0-rc6-btrfs-next-28+ #1 > [ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 > [ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000 > [ 7344.888431] RIP: 0010:[<ffffffffa037c88c>] [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs] > [ 7344.888431] RSP: 0018:ffff8802046878f0 EFLAGS: 00010282 > [ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001 > [ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff > [ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000 > [ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000 > [ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0 > [ 7344.888431] FS: 00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000 > [ 7344.888431] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0 > [ 7344.888431] Stack: > [ 7344.888431] 0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950 > [ 7344.888431] ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000 > [ 7344.888431] ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000 > [ 7344.888431] Call Trace: > [ 7344.888431] [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs] > [ 7344.888431] [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs] > [ 7344.888431] [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs] > [ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs] > [ 7344.888431] [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf > [ 7344.888431] [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs] > [ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs] > [ 7344.888431] [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs] > [ 7344.888431] [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10 > [ 7344.888431] [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs] > [ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs] > [ 7344.888431] [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd > [ 7344.888431] [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs] > [ 7344.888431] [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc > [ 7344.888431] [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207 > [ 7344.888431] [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207 > [ 7344.888431] [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154 > [ 7344.888431] [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f > [ 7344.888431] [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1 > [ 7344.888431] [<ffffffff8117773a>] __vfs_read+0x79/0x9d > [ 7344.888431] [<ffffffff81178050>] vfs_read+0x8f/0xd2 > [ 7344.888431] [<ffffffff81178a38>] SyS_read+0x50/0x7e > [ 7344.888431] [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b > [ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0 > [ 7344.888431] RIP [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs] > [ 7344.888431] RSP <ffff8802046878f0> > [ 7344.970544] ---[ end trace eb7803b24ebab8ae ]--- > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > --- > fs/btrfs/ctree.h | 14 ++++++++++++ > fs/btrfs/extent-tree.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++-- > fs/btrfs/inode.c | 8 +++++++ > fs/btrfs/relocation.c | 6 +----- > 4 files changed, 79 insertions(+), 7 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 84a6a5b..90e70e2 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -1408,6 +1408,17 @@ struct btrfs_block_group_cache { > > struct btrfs_io_ctl io_ctl; > > + /* > + * Incremented when doing extent allocations and holding a read lock > + * on the space_info's groups_sem semaphore. > + * Decremented when an ordered extent that represents an IO against this > + * block group's range is created (after it's added to its inode's > + * root's list of ordered extents) or immediately after the allocation > + * if it's a metadata extent or fallocate extent (for these cases we > + * don't create ordered extents). > + */ > + atomic_t reservations; > + > /* Lock for free space tree operations. */ > struct mutex free_space_lock; > > @@ -3499,6 +3510,9 @@ int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans, > struct btrfs_root *root); > int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans, > struct btrfs_root *root); > +void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info, > + const u64 start); > +void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg); > void btrfs_put_block_group(struct btrfs_block_group_cache *cache); > int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, > struct btrfs_root *root, unsigned long count); > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 0544f70..35e3ea9 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -6175,6 +6175,57 @@ int btrfs_exclude_logged_extents(struct btrfs_root *log, > return 0; > } > > +static void > +btrfs_inc_block_group_reservations(struct btrfs_block_group_cache *bg) > +{ > + atomic_inc(&bg->reservations); > +} > + > +void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info, > + const u64 start) > +{ > + struct btrfs_block_group_cache *bg; > + > + bg = btrfs_lookup_block_group(fs_info, start); > + BUG_ON(!bg); > + if (atomic_dec_and_test(&bg->reservations)) > + wake_up_atomic_t(&bg->reservations); > + btrfs_put_block_group(bg); > +} > + > +static int btrfs_wait_bg_reservations_atomic_t(atomic_t *a) > +{ > + schedule(); > + return 0; > +} > + > +void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg) > +{ > + struct btrfs_space_info *space_info = bg->space_info; > + > + ASSERT(bg->ro); > + > + if (!(bg->flags & BTRFS_BLOCK_GROUP_DATA)) > + return; > + > + /* > + * Our block group is read only but before we set it to read only, > + * some task might have had allocated an extent from it already, but it > + * has not yet created a respective ordered extent (and added it to a > + * root's list of ordered extents). > + * Therefore wait for any task currently allocating extents, since the > + * block group's reservations counter is incremented while a read lock > + * on the groups' semaphore is held and decremented after releasing > + * the read access on that semaphore and creating the ordered extent. > + */ > + down_write(&space_info->groups_sem); > + up_write(&space_info->groups_sem); > + > + wait_on_atomic_t(&bg->reservations, > + btrfs_wait_bg_reservations_atomic_t, > + TASK_UNINTERRUPTIBLE); > +} > + > /** > * btrfs_update_reserved_bytes - update the block_group and space info counters > * @cache: The cache we are manipulating > @@ -7434,6 +7485,7 @@ checks: > btrfs_add_free_space(block_group, offset, num_bytes); > goto loop; > } > + btrfs_inc_block_group_reservations(block_group); > > /* we are all good, lets return */ > ins->objectid = search_start; > @@ -7615,8 +7667,10 @@ again: > WARN_ON(num_bytes < root->sectorsize); > ret = find_free_extent(root, num_bytes, empty_size, hint_byte, ins, > flags, delalloc); > - > - if (ret == -ENOSPC) { > + if (!ret && !is_data) { > + btrfs_dec_block_group_reservations(root->fs_info, > + ins->objectid); > + } else if (ret == -ENOSPC) { > if (!final_tried && ins->offset) { > num_bytes = min(num_bytes >> 1, ins->offset); > num_bytes = round_down(num_bytes, root->sectorsize); > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 41a5688..0085899 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -824,6 +824,7 @@ retry: > async_extent->ram_size - 1, 0); > goto out_free_reserve; > } > + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); > > /* > * clear dirty, set writeback and unlock the pages. > @@ -861,6 +862,7 @@ retry: > } > return; > out_free_reserve: > + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); > btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1); > out_free: > extent_clear_unlock_delalloc(inode, async_extent->start, > @@ -1038,6 +1040,8 @@ static noinline int cow_file_range(struct inode *inode, > goto out_drop_extent_cache; > } > > + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); > + > if (disk_num_bytes < cur_alloc_size) > break; > > @@ -1066,6 +1070,7 @@ out: > out_drop_extent_cache: > btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0); > out_reserve: > + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); > btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1); > out_unlock: > extent_clear_unlock_delalloc(inode, start, end, locked_page, > @@ -7162,6 +7167,8 @@ static struct extent_map *btrfs_new_extent_direct(struct inode *inode, > return ERR_PTR(ret); > } > > + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); > + > em = create_pinned_em(inode, start, ins.offset, start, ins.objectid, > ins.offset, ins.offset, ins.offset, 0); > if (IS_ERR(em)) { > @@ -9942,6 +9949,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode, > btrfs_end_transaction(trans, root); > break; > } > + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); > > last_alloc = ins.offset; > ret = insert_reserved_file_extent(trans, inode, > diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c > index d768364..fb4dfd0 100644 > --- a/fs/btrfs/relocation.c > +++ b/fs/btrfs/relocation.c > @@ -4253,11 +4253,7 @@ int btrfs_relocate_block_group(struct btrfs_root *extent_root, u64 group_start) > btrfs_info(extent_root->fs_info, "relocating block group %llu flags %llu", > rc->block_group->key.objectid, rc->block_group->flags); > > - ret = btrfs_start_delalloc_roots(fs_info, 0, -1); > - if (ret < 0) { > - err = ret; > - goto out; > - } > + btrfs_wait_block_group_reservations(rc->block_group); Hi Filipe, What a nice race. How about using existing functions like btrfs_update_reserved_bytes() to fix it? btrfs_update_reserved_bytes(struct block_group * cache) { spin_lock(&cache->lock); ... if (cache->ro) ret = -EAGAIN; else { cache->reserved += num_bytes; ... } spin_unlock(&cache->lock); } If directIO see cache->ro, then it will return -EAGAIN, if not, cache->reserved is recorded, and we can wait on this to be zero. Because both directIO and fallocate are sychronous operations, relocate_block_group won't wait on (cache->reserved == 0) too long. What do you think? Thanks, -liubo > ret = btrfs_wait_ordered_roots(fs_info, -1, > rc->block_group->key.objectid, > rc->block_group->key.offset); > -- > 2.7.0.rc3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 10, 2016 at 12:56 AM, Liu Bo <bo.li.liu@oracle.com> wrote: > On Mon, Apr 25, 2016 at 02:01:12AM +0100, fdmanana@kernel.org wrote: >> From: Filipe Manana <fdmanana@suse.com> >> >> Before we start the actual relocation process of a block group, we do >> calls to flush delalloc of all inodes and then wait for ordered extents >> to complete. However we do these flush calls just to make sure we don't >> race with concurrent tasks that have actually already started to run >> delalloc and have allocated an extent from the block group we want to >> relocate, right before we set it to readonly mode, but have not yet >> created the respective ordered extents. The flush calls make us wait >> for such concurrent tasks because they end up calling >> filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() -> >> __start_delalloc_inodes() -> btrfs_alloc_delalloc_work() -> >> btrfs_run_delalloc_work()) which ends up serializing us with those tasks >> due to attempts to lock the same pages (and the delalloc flush procedure >> calls the allocator and creates the ordered extents before unlocking the >> pages). >> >> These flushing calls not only make us waste time (cpu, IO) but also reduce >> the chances of writing larger extents (applications might be writing to >> contiguous ranges and we flush before they finish dirtying the whole >> ranges). >> >> So make sure we don't flush delalloc and just wait for concurrent tasks >> that have already started flushing delalloc and have allocated an extent >> from the block group we are about to relocate. >> >> This change also ends up fixing a race with direct IO writes that makes >> relocation not wait for direct IO ordered extents. This race is >> illustrated by the following diagram: >> >> CPU 1 CPU 2 >> >> btrfs_relocate_block_group(bg X) >> >> starts direct IO write, >> target inode currently has no >> ordered extents ongoing nor >> dirty pages (delalloc regions), >> therefore the root for our inode >> is not in the list >> fs_info->ordered_roots >> >> btrfs_direct_IO() >> __blockdev_direct_IO() >> btrfs_get_blocks_direct() >> btrfs_lock_extent_direct() >> locks range in the io tree >> btrfs_new_extent_direct() >> btrfs_reserve_extent() >> --> extent allocated >> from bg X >> >> btrfs_inc_block_group_ro(bg X) >> >> btrfs_start_delalloc_roots() >> __start_delalloc_inodes() >> --> does nothing, no dealloc ranges >> in the inode's io tree so the >> inode's root is not in the list >> fs_info->delalloc_roots >> >> btrfs_wait_ordered_roots() >> --> does not find the inode's root in the list >> fs_info->ordered_roots >> >> --> ends up not waiting for the direct IO >> write started by the task at CPU 2 >> >> relocate_block_group() >> >> btrfs_add_ordered_extent_dio() >> --> now a ordered extent is >> created and added to the >> list root->ordered_extents >> and the root added to the >> list fs_info->ordered_roots >> --> this is too late and the >> task at CPU 1 already >> started the relocation >> >> The consequence is the same as described in the patch titled >> "Btrfs: fix race in relocation that makes us miss extents". When we hit >> it, the relocation process produces the following message: >> >> [ 7260.832836] ------------[ cut here ]------------ >> [ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]() >> [ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc >> [ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1 >> [ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 >> [ 7260.852998] 0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000 >> [ 7260.852998] 0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d >> [ 7260.852998] ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000 >> [ 7260.852998] Call Trace: >> [ 7260.852998] [<ffffffff812648b3>] dump_stack+0x67/0x90 >> [ 7260.852998] [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2 >> [ 7260.852998] [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs] >> [ 7260.852998] [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c >> [ 7260.852998] [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs] >> [ 7260.852998] [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs] >> [ 7260.852998] [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs] >> [ 7260.852998] [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19 >> [ 7260.852998] [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs] >> [ 7260.852998] [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs] >> [ 7260.852998] [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63 >> [ 7260.852998] [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44 >> [ 7260.852998] [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc >> [ 7260.852998] [<ffffffff811876ab>] vfs_ioctl+0x18/0x34 >> [ 7260.852998] [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be >> [ 7260.852998] [<ffffffff81190c30>] ? __fget_light+0x4d/0x71 >> [ 7260.852998] [<ffffffff81187d77>] SyS_ioctl+0x57/0x79 >> [ 7260.852998] [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b >> [ 7260.893268] ---[ end trace eb7803b24ebab8ad ]--- >> >> It is triggered because after the first stage of the relocation (rc->stage >> == MOVE_DATA_EXTENTS), we commit the current transaction and then the >> second time we call relocate_block_group() (rc->stage == UPDATE_DATA_PTRS), >> we have flushed and waited for delalloc on the relocation inode, but since >> we didn't find and move the extent in the first stage, the block group >> still has a non zero number of used bytes and therefore it triggers a >> warning at the end of btrfs_relocate_block_group(). >> >> Later on when trying to read the extent contents from disk we hit a >> BUG_ON() due to the inability to map a block with a logical address that >> belongs to the block group we relocated and is no longer valid, resulting >> in the following trace: >> >> [ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096 >> [ 7344.887518] ------------[ cut here ]------------ >> [ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833! >> [ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC >> [ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc >> [ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G W 4.5.0-rc6-btrfs-next-28+ #1 >> [ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 >> [ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000 >> [ 7344.888431] RIP: 0010:[<ffffffffa037c88c>] [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs] >> [ 7344.888431] RSP: 0018:ffff8802046878f0 EFLAGS: 00010282 >> [ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001 >> [ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff >> [ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000 >> [ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000 >> [ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0 >> [ 7344.888431] FS: 00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000 >> [ 7344.888431] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0 >> [ 7344.888431] Stack: >> [ 7344.888431] 0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950 >> [ 7344.888431] ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000 >> [ 7344.888431] ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000 >> [ 7344.888431] Call Trace: >> [ 7344.888431] [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs] >> [ 7344.888431] [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs] >> [ 7344.888431] [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs] >> [ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs] >> [ 7344.888431] [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf >> [ 7344.888431] [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs] >> [ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs] >> [ 7344.888431] [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs] >> [ 7344.888431] [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10 >> [ 7344.888431] [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs] >> [ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs] >> [ 7344.888431] [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd >> [ 7344.888431] [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs] >> [ 7344.888431] [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc >> [ 7344.888431] [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207 >> [ 7344.888431] [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207 >> [ 7344.888431] [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154 >> [ 7344.888431] [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f >> [ 7344.888431] [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1 >> [ 7344.888431] [<ffffffff8117773a>] __vfs_read+0x79/0x9d >> [ 7344.888431] [<ffffffff81178050>] vfs_read+0x8f/0xd2 >> [ 7344.888431] [<ffffffff81178a38>] SyS_read+0x50/0x7e >> [ 7344.888431] [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b >> [ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0 >> [ 7344.888431] RIP [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs] >> [ 7344.888431] RSP <ffff8802046878f0> >> [ 7344.970544] ---[ end trace eb7803b24ebab8ae ]--- >> >> Signed-off-by: Filipe Manana <fdmanana@suse.com> >> --- >> fs/btrfs/ctree.h | 14 ++++++++++++ >> fs/btrfs/extent-tree.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++-- >> fs/btrfs/inode.c | 8 +++++++ >> fs/btrfs/relocation.c | 6 +----- >> 4 files changed, 79 insertions(+), 7 deletions(-) >> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >> index 84a6a5b..90e70e2 100644 >> --- a/fs/btrfs/ctree.h >> +++ b/fs/btrfs/ctree.h >> @@ -1408,6 +1408,17 @@ struct btrfs_block_group_cache { >> >> struct btrfs_io_ctl io_ctl; >> >> + /* >> + * Incremented when doing extent allocations and holding a read lock >> + * on the space_info's groups_sem semaphore. >> + * Decremented when an ordered extent that represents an IO against this >> + * block group's range is created (after it's added to its inode's >> + * root's list of ordered extents) or immediately after the allocation >> + * if it's a metadata extent or fallocate extent (for these cases we >> + * don't create ordered extents). >> + */ >> + atomic_t reservations; >> + >> /* Lock for free space tree operations. */ >> struct mutex free_space_lock; >> >> @@ -3499,6 +3510,9 @@ int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans, >> struct btrfs_root *root); >> int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans, >> struct btrfs_root *root); >> +void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info, >> + const u64 start); >> +void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg); >> void btrfs_put_block_group(struct btrfs_block_group_cache *cache); >> int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, >> struct btrfs_root *root, unsigned long count); >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index 0544f70..35e3ea9 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -6175,6 +6175,57 @@ int btrfs_exclude_logged_extents(struct btrfs_root *log, >> return 0; >> } >> >> +static void >> +btrfs_inc_block_group_reservations(struct btrfs_block_group_cache *bg) >> +{ >> + atomic_inc(&bg->reservations); >> +} >> + >> +void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info, >> + const u64 start) >> +{ >> + struct btrfs_block_group_cache *bg; >> + >> + bg = btrfs_lookup_block_group(fs_info, start); >> + BUG_ON(!bg); >> + if (atomic_dec_and_test(&bg->reservations)) >> + wake_up_atomic_t(&bg->reservations); >> + btrfs_put_block_group(bg); >> +} >> + >> +static int btrfs_wait_bg_reservations_atomic_t(atomic_t *a) >> +{ >> + schedule(); >> + return 0; >> +} >> + >> +void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg) >> +{ >> + struct btrfs_space_info *space_info = bg->space_info; >> + >> + ASSERT(bg->ro); >> + >> + if (!(bg->flags & BTRFS_BLOCK_GROUP_DATA)) >> + return; >> + >> + /* >> + * Our block group is read only but before we set it to read only, >> + * some task might have had allocated an extent from it already, but it >> + * has not yet created a respective ordered extent (and added it to a >> + * root's list of ordered extents). >> + * Therefore wait for any task currently allocating extents, since the >> + * block group's reservations counter is incremented while a read lock >> + * on the groups' semaphore is held and decremented after releasing >> + * the read access on that semaphore and creating the ordered extent. >> + */ >> + down_write(&space_info->groups_sem); >> + up_write(&space_info->groups_sem); >> + >> + wait_on_atomic_t(&bg->reservations, >> + btrfs_wait_bg_reservations_atomic_t, >> + TASK_UNINTERRUPTIBLE); >> +} >> + >> /** >> * btrfs_update_reserved_bytes - update the block_group and space info counters >> * @cache: The cache we are manipulating >> @@ -7434,6 +7485,7 @@ checks: >> btrfs_add_free_space(block_group, offset, num_bytes); >> goto loop; >> } >> + btrfs_inc_block_group_reservations(block_group); >> >> /* we are all good, lets return */ >> ins->objectid = search_start; >> @@ -7615,8 +7667,10 @@ again: >> WARN_ON(num_bytes < root->sectorsize); >> ret = find_free_extent(root, num_bytes, empty_size, hint_byte, ins, >> flags, delalloc); >> - >> - if (ret == -ENOSPC) { >> + if (!ret && !is_data) { >> + btrfs_dec_block_group_reservations(root->fs_info, >> + ins->objectid); >> + } else if (ret == -ENOSPC) { >> if (!final_tried && ins->offset) { >> num_bytes = min(num_bytes >> 1, ins->offset); >> num_bytes = round_down(num_bytes, root->sectorsize); >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index 41a5688..0085899 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -824,6 +824,7 @@ retry: >> async_extent->ram_size - 1, 0); >> goto out_free_reserve; >> } >> + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); >> >> /* >> * clear dirty, set writeback and unlock the pages. >> @@ -861,6 +862,7 @@ retry: >> } >> return; >> out_free_reserve: >> + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); >> btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1); >> out_free: >> extent_clear_unlock_delalloc(inode, async_extent->start, >> @@ -1038,6 +1040,8 @@ static noinline int cow_file_range(struct inode *inode, >> goto out_drop_extent_cache; >> } >> >> + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); >> + >> if (disk_num_bytes < cur_alloc_size) >> break; >> >> @@ -1066,6 +1070,7 @@ out: >> out_drop_extent_cache: >> btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0); >> out_reserve: >> + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); >> btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1); >> out_unlock: >> extent_clear_unlock_delalloc(inode, start, end, locked_page, >> @@ -7162,6 +7167,8 @@ static struct extent_map *btrfs_new_extent_direct(struct inode *inode, >> return ERR_PTR(ret); >> } >> >> + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); >> + >> em = create_pinned_em(inode, start, ins.offset, start, ins.objectid, >> ins.offset, ins.offset, ins.offset, 0); >> if (IS_ERR(em)) { >> @@ -9942,6 +9949,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode, >> btrfs_end_transaction(trans, root); >> break; >> } >> + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); >> >> last_alloc = ins.offset; >> ret = insert_reserved_file_extent(trans, inode, >> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c >> index d768364..fb4dfd0 100644 >> --- a/fs/btrfs/relocation.c >> +++ b/fs/btrfs/relocation.c >> @@ -4253,11 +4253,7 @@ int btrfs_relocate_block_group(struct btrfs_root *extent_root, u64 group_start) >> btrfs_info(extent_root->fs_info, "relocating block group %llu flags %llu", >> rc->block_group->key.objectid, rc->block_group->flags); >> >> - ret = btrfs_start_delalloc_roots(fs_info, 0, -1); >> - if (ret < 0) { >> - err = ret; >> - goto out; >> - } >> + btrfs_wait_block_group_reservations(rc->block_group); > > Hi Filipe, > > What a nice race. > > How about using existing functions like btrfs_update_reserved_bytes() to fix it? > > btrfs_update_reserved_bytes(struct block_group * cache) > { > spin_lock(&cache->lock); > ... > if (cache->ro) > ret = -EAGAIN; > else { > cache->reserved += num_bytes; > ... > } > spin_unlock(&cache->lock); > } > > If directIO see cache->ro, then it will return -EAGAIN, if not, > cache->reserved is recorded, and we can wait on this to be zero. > Because both directIO and fallocate are sychronous operations, > relocate_block_group won't wait on (cache->reserved == 0) too long. Hi Bo, Initially I considered using the reserved field, as it was apparently the simplest way possible. But I gave up on it, since it's only decremented when the data delayed reference is run, which can take a long time to happen (worst, and typically the most common case, is at transaction commit time). Btw, there's a v2 patchset around, which is the same as v1 except I removed patch 1 since it's not valid (prepare_to_relocate() commits the current transaction). thanks > > What do you think? > > Thanks, > > -liubo > >> ret = btrfs_wait_ordered_roots(fs_info, -1, >> rc->block_group->key.objectid, >> rc->block_group->key.offset); >> -- >> 2.7.0.rc3 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 84a6a5b..90e70e2 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1408,6 +1408,17 @@ struct btrfs_block_group_cache { struct btrfs_io_ctl io_ctl; + /* + * Incremented when doing extent allocations and holding a read lock + * on the space_info's groups_sem semaphore. + * Decremented when an ordered extent that represents an IO against this + * block group's range is created (after it's added to its inode's + * root's list of ordered extents) or immediately after the allocation + * if it's a metadata extent or fallocate extent (for these cases we + * don't create ordered extents). + */ + atomic_t reservations; + /* Lock for free space tree operations. */ struct mutex free_space_lock; @@ -3499,6 +3510,9 @@ int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans, struct btrfs_root *root); int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans, struct btrfs_root *root); +void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info, + const u64 start); +void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg); void btrfs_put_block_group(struct btrfs_block_group_cache *cache); int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, struct btrfs_root *root, unsigned long count); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0544f70..35e3ea9 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -6175,6 +6175,57 @@ int btrfs_exclude_logged_extents(struct btrfs_root *log, return 0; } +static void +btrfs_inc_block_group_reservations(struct btrfs_block_group_cache *bg) +{ + atomic_inc(&bg->reservations); +} + +void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info, + const u64 start) +{ + struct btrfs_block_group_cache *bg; + + bg = btrfs_lookup_block_group(fs_info, start); + BUG_ON(!bg); + if (atomic_dec_and_test(&bg->reservations)) + wake_up_atomic_t(&bg->reservations); + btrfs_put_block_group(bg); +} + +static int btrfs_wait_bg_reservations_atomic_t(atomic_t *a) +{ + schedule(); + return 0; +} + +void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg) +{ + struct btrfs_space_info *space_info = bg->space_info; + + ASSERT(bg->ro); + + if (!(bg->flags & BTRFS_BLOCK_GROUP_DATA)) + return; + + /* + * Our block group is read only but before we set it to read only, + * some task might have had allocated an extent from it already, but it + * has not yet created a respective ordered extent (and added it to a + * root's list of ordered extents). + * Therefore wait for any task currently allocating extents, since the + * block group's reservations counter is incremented while a read lock + * on the groups' semaphore is held and decremented after releasing + * the read access on that semaphore and creating the ordered extent. + */ + down_write(&space_info->groups_sem); + up_write(&space_info->groups_sem); + + wait_on_atomic_t(&bg->reservations, + btrfs_wait_bg_reservations_atomic_t, + TASK_UNINTERRUPTIBLE); +} + /** * btrfs_update_reserved_bytes - update the block_group and space info counters * @cache: The cache we are manipulating @@ -7434,6 +7485,7 @@ checks: btrfs_add_free_space(block_group, offset, num_bytes); goto loop; } + btrfs_inc_block_group_reservations(block_group); /* we are all good, lets return */ ins->objectid = search_start; @@ -7615,8 +7667,10 @@ again: WARN_ON(num_bytes < root->sectorsize); ret = find_free_extent(root, num_bytes, empty_size, hint_byte, ins, flags, delalloc); - - if (ret == -ENOSPC) { + if (!ret && !is_data) { + btrfs_dec_block_group_reservations(root->fs_info, + ins->objectid); + } else if (ret == -ENOSPC) { if (!final_tried && ins->offset) { num_bytes = min(num_bytes >> 1, ins->offset); num_bytes = round_down(num_bytes, root->sectorsize); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 41a5688..0085899 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -824,6 +824,7 @@ retry: async_extent->ram_size - 1, 0); goto out_free_reserve; } + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); /* * clear dirty, set writeback and unlock the pages. @@ -861,6 +862,7 @@ retry: } return; out_free_reserve: + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1); out_free: extent_clear_unlock_delalloc(inode, async_extent->start, @@ -1038,6 +1040,8 @@ static noinline int cow_file_range(struct inode *inode, goto out_drop_extent_cache; } + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); + if (disk_num_bytes < cur_alloc_size) break; @@ -1066,6 +1070,7 @@ out: out_drop_extent_cache: btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0); out_reserve: + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1); out_unlock: extent_clear_unlock_delalloc(inode, start, end, locked_page, @@ -7162,6 +7167,8 @@ static struct extent_map *btrfs_new_extent_direct(struct inode *inode, return ERR_PTR(ret); } + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); + em = create_pinned_em(inode, start, ins.offset, start, ins.objectid, ins.offset, ins.offset, ins.offset, 0); if (IS_ERR(em)) { @@ -9942,6 +9949,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode, btrfs_end_transaction(trans, root); break; } + btrfs_dec_block_group_reservations(root->fs_info, ins.objectid); last_alloc = ins.offset; ret = insert_reserved_file_extent(trans, inode, diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index d768364..fb4dfd0 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -4253,11 +4253,7 @@ int btrfs_relocate_block_group(struct btrfs_root *extent_root, u64 group_start) btrfs_info(extent_root->fs_info, "relocating block group %llu flags %llu", rc->block_group->key.objectid, rc->block_group->flags); - ret = btrfs_start_delalloc_roots(fs_info, 0, -1); - if (ret < 0) { - err = ret; - goto out; - } + btrfs_wait_block_group_reservations(rc->block_group); ret = btrfs_wait_ordered_roots(fs_info, -1, rc->block_group->key.objectid, rc->block_group->key.offset);