[v3,0/3] Introduce per-profile available space array to avoid over-confident can_overcommit()

Message ID	20200106061343.18772-1-wqu@suse.com (mailing list archive)
Headers	show Return-Path: <SRS0=bXJ1=23=vger.kernel.org=linux-btrfs-owner@kernel.org> From: Qu Wenruo <wqu@suse.com> To: linux-btrfs@vger.kernel.org Subject: [PATCH v3 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() Date: Mon, 6 Jan 2020 14:13:40 +0800 Message-Id: <20200106061343.18772-1-wqu@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk
Series	Introduce per-profile available space array to avoid over-confident can_overcommit() \| expand [v3,0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() [v3,1/3] btrfs: Introduce per-profile available space facility [v3,2/3] btrfs: space-info: Use per-profile available space in can_overcommit() [v3,3/3] btrfs: statfs: Use virtual chunk allocation to calculation available data space

Message ID

20200106061343.18772-1-wqu@suse.com (mailing list archive)

Headers

From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH v3 0/3] Introduce per-profile available space array to avoid
 over-confident can_overcommit()
Date: Mon,  6 Jan 2020 14:13:40 +0800
Message-Id: <20200106061343.18772-1-wqu@suse.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-btrfs-owner@vger.kernel.org
Precedence: bulk

Series

Introduce per-profile available space array to avoid over-confident can_overcommit() | expand

Message

Qu Wenruo Jan. 6, 2020, 6:13 a.m. UTC

There are several bug reports of ENOSPC error in
btrfs_run_delalloc_range().

With some extra info from one reporter, it turns out that
can_overcommit() is using a wrong way to calculate allocatable metadata
space.

The most typical case would look like:
  devid 1 unallocated:	1G
  devid 2 unallocated:  10G
  metadata profile:	RAID1

In above case, we can at most allocate 1G chunk for metadata, due to
unbalanced disk free space.
But current can_overcommit() uses factor based calculation, which never
consider the disk free space balance.


To address this problem, here comes the per-profile available space
array, which gets updated every time a chunk get allocated/removed or a
device get grown or shrunk.

This provides a quick way for hotter place like can_overcommit() to grab
an estimation on how many bytes it can over-commit.

The per-profile available space calculation tries to keep the behavior
of chunk allocator, thus it can handle uneven disks pretty well.

Although per-profile is not clever enough to handle estimation when both
data and metadata chunks need to be considered, its virtual chunk
infrastructure is flex enough to handle such case.

So for statfs(), we also re-use virtual chunk allocator to handle
available data space, with metadata over-commit space considered.
This brings an unexpected advantage, now we can handle RAID5/6 pretty OK
in statfs().

The execution time of this per-profile calculation is a little below
20 us per 5 iterations in my test VM.
Although all such calculation will need to acquire chunk mutex, the
impact should be small enough.
For the full statfs execution time anaylse, please see the commit
message of the last patch.

Changelog:
v1:
- Fix a bug where we forgot to update per-profile array after allocating
  a chunk.
  To avoid ABBA deadlock, this introduce a small windows at the end
  __btrfs_alloc_chunk(), it's not elegant but should be good enough
  before we rework chunk and device list mutex.
  
- Make statfs() to use virtual chunk allocator to do better estimation
  Now statfs() can report not only more accurate result, but can also
  handle RAID5/6 better.

v2:
- Fix a deadlock caused by acquiring device_list_mutex under
  __btrfs_alloc_chunk()
  There is no need to acquire device_list_mutex when holding
  chunk_mutex.
  Fix it and remove the lockdep assert.

v3:
- Use proper chunk_mutex instead of device_list_mutex
  Since they are protecting two different things, and we only care about
  alloc_list, we should only use chunk_mutex.
  With improved lock situation, it's easier to fold
  calc_per_profile_available() calls into the first patch.

- Add performance benchmark for statfs() modification
  As Facebook seems to run into some problems with statfs() calls, add
  some basic ftrace results.

Qu Wenruo (3):
  btrfs: Introduce per-profile available space facility
  btrfs: space-info: Use per-profile available space in can_overcommit()
  btrfs: statfs: Use virtual chunk allocation to calculation available
    data space

 fs/btrfs/space-info.c |  15 ++-
 fs/btrfs/super.c      | 190 +++++++++++++-----------------------
 fs/btrfs/volumes.c    | 218 ++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/volumes.h    |  15 +++
 4 files changed, 289 insertions(+), 149 deletions(-)

Comments

David Sterba Jan. 6, 2020, 2:06 p.m. UTC | #1

On Mon, Jan 06, 2020 at 02:13:40PM +0800, Qu Wenruo wrote:
> The execution time of this per-profile calculation is a little below
> 20 us per 5 iterations in my test VM.
> Although all such calculation will need to acquire chunk mutex, the
> impact should be small enough.

The problem is not only the execution time of statfs, but what happens
when them mutex is contended. This was the problem with the block group
mutex in the past that had to be converted to RCU.

If the chunk mutex gets locked because a new chunk is allocated, until
it finishes then statfs will block. The time can vary a lot depending on
the workload and delay in seconds can trigger system monitors alert.

Qu Wenruo Jan. 7, 2020, 2:04 a.m. UTC | #2

On 2020/1/6 下午10:06, David Sterba wrote:
> On Mon, Jan 06, 2020 at 02:13:40PM +0800, Qu Wenruo wrote:
>> The execution time of this per-profile calculation is a little below
>> 20 us per 5 iterations in my test VM.
>> Although all such calculation will need to acquire chunk mutex, the
>> impact should be small enough.
> 
> The problem is not only the execution time of statfs, but what happens
> when them mutex is contended. This was the problem with the block group
> mutex in the past that had to be converted to RCU.
> 
> If the chunk mutex gets locked because a new chunk is allocated, until
> it finishes then statfs will block. The time can vary a lot depending on
> the workload and delay in seconds can trigger system monitors alert.
> 
Yes, that's exactly the same concern I have.

But I'm not sure how safe the old RCU implementation is when
device->virtual_allocated is modified during the RCU critical section.

That's to say, if a virtual chunk is being allocated during the
statfs(), then we got incorrect result.
So I tend to keep it safe by protecting it using chunk_mutex even it
means chunk_mutex can block statfs().

Another solution is to completely forget the whole metadata part, just
grab the spinlock and the pre-calculated result, but that may result
more available space than what we really have.

If the delay is really a blockage, i can go the pre-allocated way,
making the result a little less accurate.

Thanks,
Qu