From patchwork Thu Jan 16 06:04:00 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 11336133 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 71EFE92A for ; Thu, 16 Jan 2020 06:04:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4557120748 for ; Thu, 16 Jan 2020 06:04:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729747AbgAPGES (ORCPT ); Thu, 16 Jan 2020 01:04:18 -0500 Received: from mx2.suse.de ([195.135.220.15]:45152 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729718AbgAPGES (ORCPT ); Thu, 16 Jan 2020 01:04:18 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 79117B1A3; Thu, 16 Jan 2020 06:04:14 +0000 (UTC) From: Qu Wenruo To: linux-btrfs@vger.kernel.org Cc: Josef Bacik Subject: [PATCH v6 1/5] btrfs: Introduce per-profile available space facility Date: Thu, 16 Jan 2020 14:04:00 +0800 Message-Id: <20200116060404.95200-2-wqu@suse.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200116060404.95200-1-wqu@suse.com> References: <20200116060404.95200-1-wqu@suse.com> MIME-Version: 1.0 Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org [PROBLEM] There are some locations in btrfs requiring accurate estimation on how many new bytes can be allocated on unallocated space. We have two types of estimation: - Factor based calculation Just use all unallocated space, divide by the profile factor One obvious user is can_overcommit(). - Chunk allocator like calculation This will emulate the chunk allocator behavior, to get a proper estimation. The only user is btrfs_calc_avail_data_space(), utilized by btrfs_statfs(). The problem is, that function is not generic purposed enough, can't handle things like RAID5/6. Current factor based calculation can't handle the following case: devid 1 unallocated: 1T devid 2 unallocated: 10T metadata type: RAID1 If using factor, we can use (1T + 10T) / 2 = 5.5T free space for metadata. But in fact we can only get 1T free space, as we're limited by the smallest device for RAID1. [SOLUTION] This patch will introduce per-profile available space calculation, which can give an estimation based on chunk-allocator-like behavior. The difference between it and chunk allocator is mostly on rounding and [0, 1M) reserved space handling, which shouldn't cause practical impact. The newly introduced per-profile available space calculation will calculate available space for each type, using chunk-allocator like calculation. With that facility, for above device layout we get the full available space array: RAID10: 0 (not enough devices) RAID1: 1T RAID1C3: 0 (not enough devices) RAID1C4: 0 (not enough devices) DUP: 5.5T RAID0: 2T SINGLE: 11T RAID5: 1T RAID6: 0 (not enough devices) Or for a more complex example: devid 1 unallocated: 1T devid 2 unallocated: 1T devid 3 unallocated: 10T We will get an array of: RAID10: 0 (not enough devices) RAID1: 2T RAID1C3: 1T RAID1C4: 0 (not enough devices) DUP: 6T RAID0: 3T SINGLE: 12T RAID5: 2T RAID6: 0 (not enough devices) And for the each profile , we go chunk allocator level calculation: The pseudo code looks like: clear_virtual_used_space_of_all_rw_devices(); do { /* * The same as chunk allocator, despite used space, * we also take virtual used space into consideration. */ sort_device_with_virtual_free_space(); /* * Unlike chunk allocator, we don't need to bother hole/stripe * size, so we use the smallest device to make sure we can * allocated as many stripes as regular chunk allocator */ stripe_size = device_with_smallest_free->avail_space; stripe_size = min(stripe_size, to_alloc / ndevs); /* * Allocate a virtual chunk, allocated virtual chunk will * increase virtual used space, allow next iteration to * properly emulate chunk allocator behavior. */ ret = alloc_virtual_chunk(stripe_size, &allocated_size); if (ret == 0) avail += allocated_size; } while (ret == 0) As we always select the device with least free space, the device with the most space will be the first to be utilized, just like chunk allocator. For above 1T + 10T device, we will allocate a 1T virtual chunk in the first iteration, then run out of device in next iteration. Thus only get 1T free space for RAID1 type, just like what chunk allocator would do. The patch will update such per-profile available space at the following timing: - Mount time - Chunk allocation - Chunk removal - Device grow - Device shrink Those timing are all protected by chunk_mutex, and what we do are only iterating in-memory only structures, no extra IO triggered, so the performance impact should be pretty small. For the extra error handling, the principle is to keep the old behavior. That's to say, if old error handler would just return an error, then we follow it, no matter if the caller reverts the device size. For the proper error handling, they will be added in later patches. As I don't want to make the core facility bloated by the error handling code, especially some situation needs quite some new code to handle errors. Suggested-by: Josef Bacik Signed-off-by: Qu Wenruo Reviewed-by: Josef Bacik --- fs/btrfs/volumes.c | 216 ++++++++++++++++++++++++++++++++++++++++----- fs/btrfs/volumes.h | 11 +++ 2 files changed, 207 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index a6d3f08bfff3..b899c8b0d0ee 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -349,6 +349,7 @@ static struct btrfs_fs_devices *alloc_fs_devices(const u8 *fsid, INIT_LIST_HEAD(&fs_devs->devices); INIT_LIST_HEAD(&fs_devs->alloc_list); INIT_LIST_HEAD(&fs_devs->fs_list); + spin_lock_init(&fs_devs->per_profile_lock); if (fsid) memcpy(fs_devs->fsid, fsid, BTRFS_FSID_SIZE); @@ -2628,6 +2629,177 @@ static noinline int btrfs_update_device(struct btrfs_trans_handle *trans, return ret; } +/* + * sort the devices in descending order by max_avail, total_avail + */ +static int btrfs_cmp_device_info(const void *a, const void *b) +{ + const struct btrfs_device_info *di_a = a; + const struct btrfs_device_info *di_b = b; + + if (di_a->max_avail > di_b->max_avail) + return -1; + if (di_a->max_avail < di_b->max_avail) + return 1; + if (di_a->total_avail > di_b->total_avail) + return -1; + if (di_a->total_avail < di_b->total_avail) + return 1; + return 0; +} + +/* + * Return 0 if we allocated any virtual(*) chunk, and restore the size to + * @allocated_size + * Return -ENOSPC if we have no more space to allocate virtual chunk + * + * *: virtual chunk is a space holder for per-profile available space + * calculator. + * Such virtual chunks won't take on-disk space, thus called virtual, and + * only affects per-profile available space calulation. + */ +static int alloc_virtual_chunk(struct btrfs_fs_info *fs_info, + struct btrfs_device_info *devices_info, + enum btrfs_raid_types type, + u64 *allocated) +{ + const struct btrfs_raid_attr *raid_attr = &btrfs_raid_array[type]; + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; + struct btrfs_device *device; + u64 stripe_size; + int i; + int ndevs = 0; + + lockdep_assert_held(&fs_info->chunk_mutex); + + /* Go through devices to collect their unallocated space */ + list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) { + u64 avail; + if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, + &device->dev_state) || + test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) + continue; + + if (device->total_bytes > device->bytes_used + + device->virtual_allocated) + avail = device->total_bytes - device->bytes_used - + device->virtual_allocated; + else + avail = 0; + + /* And exclude the [0, 1M) reserved space */ + if (avail > SZ_1M) + avail -= SZ_1M; + else + avail = 0; + + if (avail < fs_info->sectorsize) + continue; + /* + * Unlike chunk allocator, we don't care about stripe or hole + * size, so here we use @avail directly + */ + devices_info[ndevs].dev_offset = 0; + devices_info[ndevs].total_avail = avail; + devices_info[ndevs].max_avail = avail; + devices_info[ndevs].dev = device; + ++ndevs; + } + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), + btrfs_cmp_device_info, NULL); + ndevs -= ndevs % raid_attr->devs_increment; + if (ndevs < raid_attr->devs_min) + return -ENOSPC; + if (raid_attr->devs_max) + ndevs = min(ndevs, (int)raid_attr->devs_max); + else + ndevs = min(ndevs, (int)BTRFS_MAX_DEVS(fs_info)); + + /* + * Now allocate a virtual chunk using the unallocate space of the + * device with the least unallocated space. + */ + stripe_size = round_down(devices_info[ndevs - 1].total_avail, + fs_info->sectorsize); + if (stripe_size == 0) + return -ENOSPC; + + for (i = 0; i < ndevs; i++) + devices_info[i].dev->virtual_allocated += stripe_size; + *allocated = stripe_size * (ndevs - raid_attr->nparity) / + raid_attr->ncopies; + return 0; +} + +static int calc_one_profile_avail(struct btrfs_fs_info *fs_info, + enum btrfs_raid_types type, + u64 *result_ret) +{ + struct btrfs_device_info *devices_info = NULL; + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; + struct btrfs_device *device; + u64 allocated; + u64 result = 0; + int ret = 0; + + ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES); + + /* Not enough devices, quick exit, just update the result */ + if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min) + goto out; + + devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info), + GFP_NOFS); + if (!devices_info) { + ret = -ENOMEM; + goto out; + } + /* Clear virtual chunk used space for each device */ + list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) + device->virtual_allocated = 0; + while (ret == 0) { + ret = alloc_virtual_chunk(fs_info, devices_info, type, + &allocated); + if (ret == 0) + result += allocated; + } + list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) + device->virtual_allocated = 0; +out: + kfree(devices_info); + if (ret < 0 && ret != -ENOSPC) + return ret; + *result_ret = result; + return 0; +} + +/* + * Calculate the per-profile available space array. + * + * Return 0 if we succeeded updating the array. + * Return <0 if something went wrong (ENOMEM), and the array is not + * updated. + */ +static int calc_per_profile_avail(struct btrfs_fs_info *fs_info) +{ + u64 results[BTRFS_NR_RAID_TYPES]; + int i; + int ret; + + for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) { + ret = calc_one_profile_avail(fs_info, i, &results[i]); + if (ret < 0) + return ret; + } + + spin_lock(&fs_info->fs_devices->per_profile_lock); + for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) + fs_info->fs_devices->per_profile_avail[i] = + results[i]; + spin_unlock(&fs_info->fs_devices->per_profile_lock); + return ret; +} + int btrfs_grow_device(struct btrfs_trans_handle *trans, struct btrfs_device *device, u64 new_size) { @@ -2635,6 +2807,7 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans, struct btrfs_super_block *super_copy = fs_info->super_copy; u64 old_total; u64 diff; + int ret; if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) return -EACCES; @@ -2661,7 +2834,10 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans, if (list_empty(&device->post_commit_list)) list_add_tail(&device->post_commit_list, &trans->transaction->dev_update_list); + ret = calc_per_profile_avail(fs_info); mutex_unlock(&fs_info->chunk_mutex); + if (ret < 0) + return ret; return btrfs_update_device(trans, device); } @@ -2831,7 +3007,13 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset) device->bytes_used - dev_extent_len); atomic64_add(dev_extent_len, &fs_info->free_chunk_space); btrfs_clear_space_info_full(fs_info); + ret = calc_per_profile_avail(fs_info); mutex_unlock(&fs_info->chunk_mutex); + if (ret < 0) { + mutex_unlock(&fs_devices->device_list_mutex); + btrfs_abort_transaction(trans, ret); + goto out; + } } ret = btrfs_update_device(trans, device); @@ -4526,6 +4708,11 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size) atomic64_sub(diff, &fs_info->free_chunk_space); } + ret = calc_per_profile_avail(fs_info); + if (ret < 0) { + mutex_unlock(&fs_info->chunk_mutex); + goto done; + } /* * Once the device's size has been set to the new size, ensure all * in-memory chunks are synced to disk so that the loop below sees them @@ -4690,25 +4877,6 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info, return 0; } -/* - * sort the devices in descending order by max_avail, total_avail - */ -static int btrfs_cmp_device_info(const void *a, const void *b) -{ - const struct btrfs_device_info *di_a = a; - const struct btrfs_device_info *di_b = b; - - if (di_a->max_avail > di_b->max_avail) - return -1; - if (di_a->max_avail < di_b->max_avail) - return 1; - if (di_a->total_avail > di_b->total_avail) - return -1; - if (di_a->total_avail < di_b->total_avail) - return 1; - return 0; -} - static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type) { if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK)) @@ -4986,6 +5154,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, list_add_tail(&dev->post_commit_list, &trans->transaction->dev_update_list); } + ret = calc_per_profile_avail(info); atomic64_sub(stripe_size * map->num_stripes, &info->free_chunk_space); @@ -4994,7 +5163,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, check_raid1c34_incompat_flag(info, type); kfree(devices_info); - return 0; + return ret; error_del_extent: write_lock(&em_tree->lock); @@ -7629,6 +7798,13 @@ int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info) /* Ensure all chunks have corresponding dev extents */ ret = verify_chunk_dev_extent_mapping(fs_info); + if (ret < 0) + goto out; + + /* All dev extents are verified, update per-profile available space */ + mutex_lock(&fs_info->chunk_mutex); + ret = calc_per_profile_avail(fs_info); + mutex_unlock(&fs_info->chunk_mutex); out: btrfs_free_path(path); return ret; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index fc1b564b9cfe..5cddfe7cfee8 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -138,6 +138,13 @@ struct btrfs_device { atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX]; struct extent_io_tree alloc_state; + + /* + * the "virtual" allocated space by virtual chunk allocator, which + * is used to do accurate estimation on available space. + * Doesn't affect real chunk allocator. + */ + u64 virtual_allocated; }; /* @@ -257,6 +264,10 @@ struct btrfs_fs_devices { struct kobject fsid_kobj; struct kobject *device_dir_kobj; struct completion kobj_unregister; + + /* Records per-type available space */ + spinlock_t per_profile_lock; + u64 per_profile_avail[BTRFS_NR_RAID_TYPES]; }; #define BTRFS_BIO_INLINE_CSUM_SIZE 64 From patchwork Thu Jan 16 06:04:01 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 11336135 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7F0D6109A for ; Thu, 16 Jan 2020 06:04:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 5DA33207E0 for ; Thu, 16 Jan 2020 06:04:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729758AbgAPGEU (ORCPT ); Thu, 16 Jan 2020 01:04:20 -0500 Received: from mx2.suse.de ([195.135.220.15]:45164 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729718AbgAPGEU (ORCPT ); Thu, 16 Jan 2020 01:04:20 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id C20E7B1A2; Thu, 16 Jan 2020 06:04:18 +0000 (UTC) From: Qu Wenruo To: linux-btrfs@vger.kernel.org Cc: Marc Lehmann , Josef Bacik Subject: [PATCH v6 2/5] btrfs: space-info: Use per-profile available space in can_overcommit() Date: Thu, 16 Jan 2020 14:04:01 +0800 Message-Id: <20200116060404.95200-3-wqu@suse.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200116060404.95200-1-wqu@suse.com> References: <20200116060404.95200-1-wqu@suse.com> MIME-Version: 1.0 Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org For the following disk layout, can_overcommit() can cause false confidence in available space: devid 1 unallocated: 1T devid 2 unallocated: 10T metadata type: RAID1 As can_overcommit() simply uses unallocated space with factor to calculate the allocatable metadata chunk size. can_overcommit() believes we still have 5.5T for metadata chunks, while the truth is, we only have 1T available for metadata chunks. This can lead to ENOSPC at run_delalloc_range() and cause transaction abort. Since factor based calculation can't distinguish RAID1/RAID10 and DUP at all, we need proper chunk-allocator level awareness to do such estimation. Thankfully, we have per-profile available space already calculated, just use that facility to avoid such false confidence. Reported-by: Marc Lehmann Signed-off-by: Qu Wenruo Reviewed-by: Josef Bacik --- fs/btrfs/space-info.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c index f09aa6ee9113..c26aba9e7124 100644 --- a/fs/btrfs/space-info.c +++ b/fs/btrfs/space-info.c @@ -164,10 +164,10 @@ static int can_overcommit(struct btrfs_fs_info *fs_info, enum btrfs_reserve_flush_enum flush, bool system_chunk) { + enum btrfs_raid_types index; u64 profile; u64 avail; u64 used; - int factor; /* Don't overcommit when in mixed mode. */ if (space_info->flags & BTRFS_BLOCK_GROUP_DATA) @@ -179,16 +179,15 @@ static int can_overcommit(struct btrfs_fs_info *fs_info, profile = btrfs_metadata_alloc_profile(fs_info); used = btrfs_space_info_used(space_info, true); - avail = atomic64_read(&fs_info->free_chunk_space); /* - * If we have dup, raid1 or raid10 then only half of the free - * space is actually usable. For raid56, the space info used - * doesn't include the parity drive, so we don't have to - * change the math + * Grab avail space from per-profile array which should be as accurate + * as chunk allocator. */ - factor = btrfs_bg_type_to_factor(profile); - avail = div_u64(avail, factor); + index = btrfs_bg_flags_to_raid_index(profile); + spin_lock(&fs_info->fs_devices->per_profile_lock); + avail = fs_info->fs_devices->per_profile_avail[index]; + spin_unlock(&fs_info->fs_devices->per_profile_lock); /* * If we aren't flushing all things, let us overcommit up to From patchwork Thu Jan 16 06:04:02 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 11336137 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1E013109A for ; Thu, 16 Jan 2020 06:04:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id F02C520748 for ; Thu, 16 Jan 2020 06:04:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729774AbgAPGEY (ORCPT ); Thu, 16 Jan 2020 01:04:24 -0500 Received: from mx2.suse.de ([195.135.220.15]:45188 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729718AbgAPGEY (ORCPT ); Thu, 16 Jan 2020 01:04:24 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 0C70BB1A2 for ; Thu, 16 Jan 2020 06:04:22 +0000 (UTC) From: Qu Wenruo To: linux-btrfs@vger.kernel.org Subject: [PATCH v6 3/5] btrfs: statfs: Use pre-calculated per-profile available space Date: Thu, 16 Jan 2020 14:04:02 +0800 Message-Id: <20200116060404.95200-4-wqu@suse.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200116060404.95200-1-wqu@suse.com> References: <20200116060404.95200-1-wqu@suse.com> MIME-Version: 1.0 Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Although btrfs_calc_avail_data_space() is trying to do an estimation on how many data chunks it can allocate, the estimation is far from perfect: - Metadata over-commit is not considered at all - Chunk allocation doesn't take RAID5/6 into consideration This patch will change btrfs_calc_avail_data_space() to use pre-calculated per-profile available space. This provides the following benefits: - Accurate unallocated data space estimation, including RAID5/6 It's as accurate as chunk allocator, and can handle RAID5/6. Although it still can't handle metadata over-commit that accurately, we still have fallback method for over-commit, by using factor based estimation. The good news is, over-commit can only happen when we have enough unallocated space, so even we may not report byte accurate result when the fs is empty, the result will get more and more accurate when unallocated space is reducing. So the metadata over-commit shouldn't cause too many problem. Since we're keeping the old lock-free design, statfs should not experience any extra delay. Signed-off-by: Qu Wenruo Reviewed-by: Josef Bacik --- fs/btrfs/super.c | 182 +++++++++++++---------------------------------- 1 file changed, 48 insertions(+), 134 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index f452a94abdc3..b5a32339e544 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1894,118 +1894,53 @@ static inline void btrfs_descending_sort_devices( * The helper to calc the free space on the devices that can be used to store * file data. */ -static inline int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info, - u64 *free_bytes) +static u64 btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info, + u64 free_meta) { - struct btrfs_device_info *devices_info; struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; - struct btrfs_device *device; - u64 type; - u64 avail_space; - u64 min_stripe_size; - int num_stripes = 1; - int i = 0, nr_devices; - const struct btrfs_raid_attr *rattr; + enum btrfs_raid_types data_type; + u64 data_profile = btrfs_data_alloc_profile(fs_info); + u64 data_avail; + u64 meta_rsv; - /* - * We aren't under the device list lock, so this is racy-ish, but good - * enough for our purposes. - */ - nr_devices = fs_info->fs_devices->open_devices; - if (!nr_devices) { - smp_mb(); - nr_devices = fs_info->fs_devices->open_devices; - ASSERT(nr_devices); - if (!nr_devices) { - *free_bytes = 0; - return 0; - } - } - - devices_info = kmalloc_array(nr_devices, sizeof(*devices_info), - GFP_KERNEL); - if (!devices_info) - return -ENOMEM; - - /* calc min stripe number for data space allocation */ - type = btrfs_data_alloc_profile(fs_info); - rattr = &btrfs_raid_array[btrfs_bg_flags_to_raid_index(type)]; - - if (type & BTRFS_BLOCK_GROUP_RAID0) - num_stripes = nr_devices; - else if (type & BTRFS_BLOCK_GROUP_RAID1) - num_stripes = 2; - else if (type & BTRFS_BLOCK_GROUP_RAID1C3) - num_stripes = 3; - else if (type & BTRFS_BLOCK_GROUP_RAID1C4) - num_stripes = 4; - else if (type & BTRFS_BLOCK_GROUP_RAID10) - num_stripes = 4; + spin_lock(&fs_info->global_block_rsv.lock); + meta_rsv = fs_info->global_block_rsv.size; + spin_unlock(&fs_info->global_block_rsv.lock); - /* Adjust for more than 1 stripe per device */ - min_stripe_size = rattr->dev_stripes * BTRFS_STRIPE_LEN; + data_type = btrfs_bg_flags_to_raid_index(data_profile); - rcu_read_lock(); - list_for_each_entry_rcu(device, &fs_devices->devices, dev_list) { - if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, - &device->dev_state) || - !device->bdev || - test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) - continue; + spin_lock(&fs_devices->per_profile_lock); + data_avail = fs_devices->per_profile_avail[data_type]; + spin_unlock(&fs_devices->per_profile_lock); - if (i >= nr_devices) - break; - - avail_space = device->total_bytes - device->bytes_used; - - /* align with stripe_len */ - avail_space = rounddown(avail_space, BTRFS_STRIPE_LEN); - - /* - * In order to avoid overwriting the superblock on the drive, - * btrfs starts at an offset of at least 1MB when doing chunk - * allocation. - * - * This ensures we have at least min_stripe_size free space - * after excluding 1MB. - */ - if (avail_space <= SZ_1M + min_stripe_size) - continue; - - avail_space -= SZ_1M; - - devices_info[i].dev = device; - devices_info[i].max_avail = avail_space; - - i++; - } - rcu_read_unlock(); - - nr_devices = i; - - btrfs_descending_sort_devices(devices_info, nr_devices); - - i = nr_devices - 1; - avail_space = 0; - while (nr_devices >= rattr->devs_min) { - num_stripes = min(num_stripes, nr_devices); - - if (devices_info[i].max_avail >= min_stripe_size) { - int j; - u64 alloc_size; - - avail_space += devices_info[i].max_avail * num_stripes; - alloc_size = devices_info[i].max_avail; - for (j = i + 1 - num_stripes; j <= i; j++) - devices_info[j].max_avail -= alloc_size; - } - i--; - nr_devices--; + /* + * We have meta over-committed, do some wild guess using factor. + * + * To get an accurate result, we should allocate a metadata virtual + * chunk first, then allocate data virtual chunks to get real + * estimation. + * But that needs chunk_mutex, which could be very slow to accquire. + * + * So here we trade for non-blocking statfs. And meta over-committing is + * less a problem because: + * - Meta over-commit only happens when we have unallocated space + * So no over-commit if we're low on available space. + * + * This may not be as accurate as virtual chunk based one, but it + * should be good enough for statfs usage. + */ + if (free_meta < meta_rsv) { + u64 meta_needed = meta_rsv - free_meta; + int data_factor = btrfs_bg_type_to_factor(data_profile); + int meta_factor = btrfs_bg_type_to_factor( + btrfs_metadata_alloc_profile(fs_info)); + + if (data_avail < meta_needed * meta_factor / data_factor) + data_avail = 0; + else + data_avail -= meta_needed * meta_factor / data_factor; } - - kfree(devices_info); - *free_bytes = avail_space; - return 0; + return data_avail; } /* @@ -2016,10 +1951,13 @@ static inline int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info, * * Unused device space usage is based on simulating the chunk allocator * algorithm that respects the device sizes and order of allocations. This is - * a close approximation of the actual use but there are other factors that may - * change the result (like a new metadata chunk). + * a very close approximation of the actual use. + * There are some inaccurate accounting for metadata over-commit, but it + * shouldn't cause big difference. * - * If metadata is exhausted, f_bavail will be 0. + * There should be no way to exhaust metadata so that we can't even delete + * files. Thus the old "exhaused metadata means 0 f_bavail" behavior will be + * discard. */ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) { @@ -2033,8 +1971,6 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) __be32 *fsid = (__be32 *)fs_info->fs_devices->fsid; unsigned factor = 1; struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv; - int ret; - u64 thresh = 0; int mixed = 0; rcu_read_lock(); @@ -2082,31 +2018,9 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) buf->f_bfree = 0; spin_unlock(&block_rsv->lock); - buf->f_bavail = div_u64(total_free_data, factor); - ret = btrfs_calc_avail_data_space(fs_info, &total_free_data); - if (ret) - return ret; - buf->f_bavail += div_u64(total_free_data, factor); + buf->f_bavail = btrfs_calc_avail_data_space(fs_info, total_free_meta); + buf->f_bavail += total_free_data; buf->f_bavail = buf->f_bavail >> bits; - - /* - * We calculate the remaining metadata space minus global reserve. If - * this is (supposedly) smaller than zero, there's no space. But this - * does not hold in practice, the exhausted state happens where's still - * some positive delta. So we apply some guesswork and compare the - * delta to a 4M threshold. (Practically observed delta was ~2M.) - * - * We probably cannot calculate the exact threshold value because this - * depends on the internal reservations requested by various - * operations, so some operations that consume a few metadata will - * succeed even if the Avail is zero. But this is better than the other - * way around. - */ - thresh = SZ_4M; - - if (!mixed && total_free_meta - thresh < block_rsv->size) - buf->f_bavail = 0; - buf->f_type = BTRFS_SUPER_MAGIC; buf->f_bsize = dentry->d_sb->s_blocksize; buf->f_namelen = BTRFS_NAME_LEN; From patchwork Thu Jan 16 06:04:03 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 11336139 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 376FE92A for ; Thu, 16 Jan 2020 06:04:27 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 171FD207E0 for ; Thu, 16 Jan 2020 06:04:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729792AbgAPGE0 (ORCPT ); Thu, 16 Jan 2020 01:04:26 -0500 Received: from mx2.suse.de ([195.135.220.15]:45194 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729718AbgAPGE0 (ORCPT ); Thu, 16 Jan 2020 01:04:26 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 39DDCB1A3 for ; Thu, 16 Jan 2020 06:04:24 +0000 (UTC) From: Qu Wenruo To: linux-btrfs@vger.kernel.org Subject: [PATCH v6 4/5] btrfs: Reset device size when btrfs_update_device() failed in btrfs_grow_device() Date: Thu, 16 Jan 2020 14:04:03 +0800 Message-Id: <20200116060404.95200-5-wqu@suse.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200116060404.95200-1-wqu@suse.com> References: <20200116060404.95200-1-wqu@suse.com> MIME-Version: 1.0 Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org When btrfs_update_device() failed due to ENOMEM, we didn't reset device size back to its original size, causing the in-memory device size larger than original. If somehow the memory pressure get solved, and the fs committed, since the device item is not updated, but super block total size get updated, it would cause mount failure due to size mismatch. So here revert device size and super size to its original size when btrfs_update_device() failed, just like what we did in shrink_device(). Signed-off-by: Qu Wenruo --- fs/btrfs/volumes.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index b899c8b0d0ee..6ba66c7cff2e 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2805,6 +2805,7 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans, { struct btrfs_fs_info *fs_info = device->fs_info; struct btrfs_super_block *super_copy = fs_info->super_copy; + u64 old_device_size; u64 old_total; u64 diff; int ret; @@ -2815,6 +2816,7 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans, new_size = round_down(new_size, fs_info->sectorsize); mutex_lock(&fs_info->chunk_mutex); + old_device_size = device->total_bytes; old_total = btrfs_super_total_bytes(super_copy); diff = round_down(new_size - device->total_bytes, fs_info->sectorsize); @@ -2837,9 +2839,25 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans, ret = calc_per_profile_avail(fs_info); mutex_unlock(&fs_info->chunk_mutex); if (ret < 0) - return ret; + goto out; - return btrfs_update_device(trans, device); + ret = btrfs_update_device(trans, device); +out: + if (ret < 0) { + /* + * Although we dropped chunk_mutex halfway for + * btrfs_update_device(), we have FS_EXCL_OP bit to prevent + * shrinking/growing race. + * So we're safe to use the old size directly. + */ + mutex_lock(&fs_info->chunk_mutex); + btrfs_set_super_total_bytes(super_copy, old_total); + device->fs_devices->total_rw_bytes -= diff; + btrfs_device_set_total_bytes(device, old_device_size); + btrfs_device_set_disk_total_bytes(device, old_device_size); + mutex_unlock(&fs_info->chunk_mutex); + } + return ret; } static int btrfs_free_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset) From patchwork Thu Jan 16 06:04:04 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 11336141 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 97BBD109A for ; Thu, 16 Jan 2020 06:04:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 81153207E0 for ; Thu, 16 Jan 2020 06:04:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729829AbgAPGE1 (ORCPT ); Thu, 16 Jan 2020 01:04:27 -0500 Received: from mx2.suse.de ([195.135.220.15]:45202 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729718AbgAPGE1 (ORCPT ); Thu, 16 Jan 2020 01:04:27 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 2CA6CB1A2 for ; Thu, 16 Jan 2020 06:04:26 +0000 (UTC) From: Qu Wenruo To: linux-btrfs@vger.kernel.org Subject: [PATCH v6 5/5] btrfs: volumes: Revert device used bytes when calc_per_profile_avail() failed Date: Thu, 16 Jan 2020 14:04:04 +0800 Message-Id: <20200116060404.95200-6-wqu@suse.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200116060404.95200-1-wqu@suse.com> References: <20200116060404.95200-1-wqu@suse.com> MIME-Version: 1.0 Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org In __btrfs_alloc_chunk(), if calc_per_profile_avail() failed, it will not update the per-profile available space array. However since device used space is already updated, this would cause a mismatch between them. To fix this problem, this patch will revert device used bytes when calc_per_profile_avail() failed, and remove the newly allocated chunk. Signed-off-by: Qu Wenruo --- fs/btrfs/volumes.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 6ba66c7cff2e..749c647818aa 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5173,6 +5173,15 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, &trans->transaction->dev_update_list); } ret = calc_per_profile_avail(info); + if (ret < 0) { + for (i = 0; i < map->num_stripes; i++) { + struct btrfs_device *dev = map->stripes[i].dev; + + btrfs_device_set_bytes_used(dev, + dev->bytes_used - stripe_size); + } + goto error_del_extent; + } atomic64_sub(stripe_size * map->num_stripes, &info->free_chunk_space);