[v9,1/5] btrfs: Introduce per-profile available space facility

[PROBLEM]
There are some locations in btrfs requiring accurate estimation on how
many new bytes can be allocated on unallocated space.

We have two types of estimation:
- Factor based calculation
  Just use all unallocated space, divide by the profile factor
  One obvious user is can_overcommit().

- Chunk allocator like calculation
  This will emulate the chunk allocator behavior, to get a proper
  estimation.
  The only user is btrfs_calc_avail_data_space(), utilized by
  btrfs_statfs().
  The problem is, that function is not generic purposed enough, can't
  handle things like RAID5/6.

Current factor based calculation can't handle the following case:
  devid 1 unallocated:	1T
  devid 2 unallocated:	10T
  metadata type:	RAID1

If using factor, we can use (1T + 10T) / 2 = 5.5T free space for
metadata.
But in fact we can only get 1T free space, as we're limited by the
smallest device for RAID1.

[SOLUTION]
This patch will introduce per-profile available space calculation,
which can give an estimation based on chunk-allocator-like behavior.

The difference between it and chunk allocator is mostly on rounding and
[0, 1M) reserved space handling, which shouldn't cause practical impact.

The newly introduced per-profile available space calculation will
calculate available space for each type, using chunk-allocator like
calculation.

With that facility, for above device layout we get the full available
space array:
  RAID10:	0  (not enough devices)
  RAID1:	1T
  RAID1C3:	0  (not enough devices)
  RAID1C4:	0  (not enough devices)
  DUP:		5.5T
  RAID0:	2T
  SINGLE:	11T
  RAID5:	1T
  RAID6:	0  (not enough devices)

Or for a more complex example:
  devid 1 unallocated:	1T
  devid 2 unallocated:  1T
  devid 3 unallocated:	10T

We will get an array of:
  RAID10:	0  (not enough devices)
  RAID1:	2T
  RAID1C3:	1T
  RAID1C4:	0  (not enough devices)
  DUP:		6T
  RAID0:	3T
  SINGLE:	12T
  RAID5:	2T
  RAID6:	0  (not enough devices)

And for the each profile , we go chunk allocator level calculation:
The pseudo code looks like:

  clear_virtual_used_space_of_all_rw_devices();
  do {
  	/*
  	 * The same as chunk allocator, despite used space,
  	 * we also take virtual used space into consideration.
  	 */
  	sort_device_with_virtual_free_space();

  	/*
  	 * Unlike chunk allocator, we don't need to bother hole/stripe
  	 * size, so we use the smallest device to make sure we can
  	 * allocated as many stripes as regular chunk allocator
  	 */
  	stripe_size = device_with_smallest_free->avail_space;
	stripe_size = min(stripe_size, to_alloc / ndevs);

  	/*
  	 * Allocate a virtual chunk, allocated virtual chunk will
  	 * increase virtual used space, allow next iteration to
  	 * properly emulate chunk allocator behavior.
  	 */
  	ret = alloc_virtual_chunk(stripe_size, &allocated_size);
  	if (ret == 0)
  		avail += allocated_size;
  } while (ret == 0)

As we always select the device with least free space, the device with
the most space will be the first to be utilized, just like chunk
allocator.
For above 1T + 10T device, we will allocate a 1T virtual chunk
in the first iteration, then run out of device in next iteration.

Thus only get 1T free space for RAID1 type, just like what chunk
allocator would do.

The patch will update such per-profile available space at the following
timing:
- Mount time
- Chunk allocation
- Chunk removal
- Device grow
- Device shrink
- Device add
- Device removal

Those timing are all protected by chunk_mutex, and what we do are only
iterating in-memory only structures, no extra IO triggered, so the
performance impact should be pretty small.

For the extra error handling, the principle is to keep the old behavior.
That's to say, if old error handler would just return an error, then we
follow it, no matter if the caller reverts the device size.

For the proper error handling, they will be added in later patches.
As I don't want to make the core facility bloated by the error handling
code, especially some situation needs quite some new code to handle
errors.

Suggested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 213 +++++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/volumes.h |  10 +++
 2 files changed, 204 insertions(+), 19 deletions(-)

Message ID	20200602090905.63899-2-wqu@suse.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=47n9=7P=vger.kernel.org=linux-btrfs-owner@kernel.org> Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 57E85912 for <patchwork-linux-btrfs@patchwork.kernel.org>; Tue, 2 Jun 2020 09:09:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3C6A62077D for <patchwork-linux-btrfs@patchwork.kernel.org>; Tue, 2 Jun 2020 09:09:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727977AbgFBJJR (ORCPT <rfc822;patchwork-linux-btrfs@patchwork.kernel.org>); Tue, 2 Jun 2020 05:09:17 -0400 Received: from mx2.suse.de ([195.135.220.15]:42960 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726160AbgFBJJQ (ORCPT <rfc822;linux-btrfs@vger.kernel.org>); Tue, 2 Jun 2020 05:09:16 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id D665AAC49; Tue, 2 Jun 2020 09:09:15 +0000 (UTC) From: Qu Wenruo <wqu@suse.com> To: linux-btrfs@vger.kernel.org Cc: Josef Bacik <josef@toxicpanda.com> Subject: [PATCH v9 1/5] btrfs: Introduce per-profile available space facility Date: Tue, 2 Jun 2020 17:09:01 +0800 Message-Id: <20200602090905.63899-2-wqu@suse.com> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20200602090905.63899-1-wqu@suse.com> References: <20200602090905.63899-1-wqu@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: <linux-btrfs.vger.kernel.org> X-Mailing-List: linux-btrfs@vger.kernel.org
Series	Introduce per-profile available space array to avoid over-confident can_overcommit() \| expand [v9,0/5] Introduce per-profile available space array to avoid over-confident can_overcommit() [v9,1/5] btrfs: Introduce per-profile available space facility [v9,2/5] btrfs: space-info: Use per-profile available space in can_overcommit() [v9,3/5] btrfs: statfs: Use pre-calculated per-profile available space [v9,4/5] btrfs: Reset device size when btrfs_update_device() failed in btrfs_grow_device() [v9,5/5] btrfs: volumes: Revert device used bytes when calc_per_profile_avail() failed

[v9,1/5] btrfs: Introduce per-profile available space facility

Commit Message

Patch