btrfs: relocation: Allow 'btrfs balance cancel' to return quicker

[PROBLEM]
There are quite some users reporting that 'btrfs balance cancel' slow to
cancel current running balance, or even doesn't work for certain dead
balance loop.

With the following script showing how long it takes to fully stop a
balance:
  #!/bin/bash
  dev=/dev/test/test
  mnt=/mnt/btrfs

  umount $mnt &> /dev/null
  umount $dev &> /dev/null

  mkfs.btrfs -f $dev
  mount $dev -o nospace_cache $mnt

  dd if=/dev/zero bs=1M of=$mnt/large &
  dd_pid=$!

  sleep 3
  kill -KILL $dd_pid
  sync

  btrfs balance start --bg --full $mnt &
  sleep 1

  echo "cancel request" >> /dev/kmsg
  time btrfs balance cancel $mnt
  umount $mnt

It takes around 7~10s to cancel the running balance in my test
environment.

[CAUSE]
Btrfs uses btrfs_fs_info::balance_cancel_req to record how many cancel
request are queued.

And btrfs checks this value only in the following call sites:
btrfs_balance()
|- atomic_read(&fs_info->balance_cancel_req); <<< 1
|- __btrfs_balance()
   |- while (1) {
   |  /* Per chunk iteration */
   |-    atomic_read(&fs_info->balance_cancel_req); <<< 2

The first check is near useless, as it happens at the very beginning of
balance, thus it's too rare to hit.

The sencond check is the most common hit, but it's too slow, only hit
after each chunk get relocated.

For certain bug reports, like "Found 1 extents" loop where we are
dead-looping inside btrfs_relocate_block_group(), it's useless.

[FIX]
This patch will introduce more cancel check at the following call sites:
btrfs_balance()
|- __btrfs_balance()
   |- btrfs_relocate_block_group()
      |- while (1) { /* Per relocation-stage loop, 2~3 runs */
      |-    should_cancel_balance()	<<< 1
      |-    balance_block_group()
      |- }

/* Call site 1 workaround dead balance loop */
Call site 1 will allow user to workaround the mentioned dead balance
loop by properly canceling it.

balance_block_group()
|- while (1) { /* Per-extent iteration */
|-    relocate_data_extent()
|     |- relocate_file_extent_cluster()
|        |- should_cancel_balance()	<<< 2
|-    should_cancel_balance()		<<< 3
|- }
|- relocate_file_extent_cluster()

/* Call site 2 for data heavy relocation */
As we spend a lot of time doing page reading for data relocation, such
check can make exit much quicker for data relocation.
This check has a bytes based filter (every 32M) to prevent wasting too
much CPU time checking it.

/* Call site 3 for meta heavy relocation */
The check has a nr_extent based filter (every 256 extents) to prevent
wasting too much CPU time.

/* Error injection to do full coverage test */
This patch packs the regular atomic_read() into a separate function,
should_cancel_balance() to allow error injection.
So we can do a full coverage test.

With this patch, the response time has reduced from 7~10s to 0.5~1.5s for
data relocation.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h      |  1 +
 fs/btrfs/relocation.c | 41 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.c    |  6 +++---
 3 files changed, 45 insertions(+), 3 deletions(-)

Message ID	20191202070235.33099-1-wqu@suse.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=GUsV=ZY=vger.kernel.org=linux-btrfs-owner@kernel.org> Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CEE38930 for <patchwork-linux-btrfs@patchwork.kernel.org>; Mon, 2 Dec 2019 07:02:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id AD0B8215E5 for <patchwork-linux-btrfs@patchwork.kernel.org>; Mon, 2 Dec 2019 07:02:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726087AbfLBHCo (ORCPT <rfc822;patchwork-linux-btrfs@patchwork.kernel.org>); Mon, 2 Dec 2019 02:02:44 -0500 Received: from mx2.suse.de ([195.135.220.15]:60446 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726002AbfLBHCo (ORCPT <rfc822;linux-btrfs@vger.kernel.org>); Mon, 2 Dec 2019 02:02:44 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 37301B2DD for <linux-btrfs@vger.kernel.org>; Mon, 2 Dec 2019 07:02:40 +0000 (UTC) From: Qu Wenruo <wqu@suse.com> To: linux-btrfs@vger.kernel.org Subject: [PATCH] btrfs: relocation: Allow 'btrfs balance cancel' to return quicker Date: Mon, 2 Dec 2019 15:02:35 +0800 Message-Id: <20191202070235.33099-1-wqu@suse.com> X-Mailer: git-send-email 2.24.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: <linux-btrfs.vger.kernel.org> X-Mailing-List: linux-btrfs@vger.kernel.org
Series	btrfs: relocation: Allow 'btrfs balance cancel' to return quicker \| expand btrfs: relocation: Allow 'btrfs balance cancel' to return quicker

btrfs: relocation: Allow 'btrfs balance cancel' to return quicker

Commit Message

Comments

Patch