From patchwork Thu Nov 19 07:28:28 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 11916643 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2D4B6C2D0E4 for ; Thu, 19 Nov 2020 07:28:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B41F820658 for ; Thu, 19 Nov 2020 07:28:35 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com header.b="pmLjdbNG" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726269AbgKSH2e (ORCPT ); Thu, 19 Nov 2020 02:28:34 -0500 Received: from mx2.suse.de ([195.135.220.15]:53448 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725843AbgKSH2e (ORCPT ); Thu, 19 Nov 2020 02:28:34 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1605770912; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=buVK0bp4q9/of8nGBsPpGgTV1gGTa87cMVbcCiYAdTI=; b=pmLjdbNGzfl/4GibUFjf7WyGtl+xcqKEKB2UlnKA2qgHx+3IGWe7frtYiYU76NGN3m/NRF 36qw95m6FAlGrwUkUcHEEz1zvwparZqBITHpJZqwOk83Bd1f4D2uwLv+PPpAnwyfsnGhMU G4wvZxcPmZv33EfsoZO2l6BiHI/A9ZA= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 663E9AEA3 for ; Thu, 19 Nov 2020 07:28:32 +0000 (UTC) From: Qu Wenruo To: linux-btrfs@vger.kernel.org Subject: [PATCH RFC] btrfs: introduce qgroup dirty extents threshold mechanism for snapshot drop and inode truncation Date: Thu, 19 Nov 2020 15:28:28 +0800 Message-Id: <20201119072828.70909-1-wqu@suse.com> X-Mailer: git-send-email 2.29.2 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org When dropping a lot of btrfs subvolumes with qgroup enabled, there can be a pretty large latency for btrfs_commit_transaction(). The reason is, dropping subvolumes/snapshots will create a lot of extent owner change, thus qgroup have to traces such owner changes and cause latency in btrfs_commit_transaction(). For snapshot/subvolume drop, we don't really have any good way to reduce the number of dirty qgroup extents. But least we can still reduce the latency of each btrfs_commit_transaction() run, by trying to commit transaction when the dirty qgroup extent number reaches a certain threshold. By this, we can commit several small transactions instead of a big and slow transaction. This patch will introduce the following things: - The ability to trace how many dirty qgroup extents for one transaction A new member, atomic64_t nr_dirty_extents, is introduced to btrfs_delayed_ref_root. - Introduce btrfs_should_commit_trans() helper Now btrfs_should_end_transaction() will also call btrfs_should_commit_trans() before returning. - Commit transaction for subvolume drop if we hits the threshold - Commit transaction for inode truncation if we hits the threshold There is some quick benchmarking for it. The fs is created by the following script: for (( j = 0; j < 16; j++ )); do btrfs subv create $mnt/src/subvol_$j for (( i = 0; i < 512; i++)) ; do xfs_io -f -c "pwrite 0 2k" $mnt/src/subvol_$j/file_inline_$i > /dev/null xfs_io -f -c "pwrite 0 4k" $mnt/src/subvol_$j/file_reg_$i > /dev/null done done sync btrfs quota enable $mnt btrfs quota rescan -w $mnt btrfs sub delete $mnt/src/subvol* I tried several threshold value, the execution time for btrfs_qgroup_account_extents() are: Threshold | Number of calls | Average execution time ------------------------------------------------------------------------ infinite | 1 | 770.74ms 8K | 3 | 280.47ms 4K | 5 | 146.41ms 2K | 9 | 72.36ms 1K | 18 | 35.97ms Currently I choose the 4K as the threshold for its minimal impact on the number of new transactions to be committed, while still keep the latency more or less acceptable. There is another hidden pitfall, if all these extents are mostly shared between different snapshots, current snapshot/subvolume dropping mechanism (breadth-first search) makes the lower level leaves to trigger tons of backref walk, while the higher level tree blocks will only trigger less and less work load. Thus this enhancement won't be that obvious to drop such mostly shared snapshots. To address that, we need to rework how we drop snapshots/subvolumes, and it would definitely be another story. Signed-off-by: Qu Wenruo --- Reason for RFC: - The threshold value Any immediate number is not flex enough, but I don't have better ideas on how to set the value without introducing extra and complex on-disk format change. This threshold doesn't deserve that large change on on-disk format, nor even a mount option. --- fs/btrfs/delayed-ref.h | 20 ++++++++++++++++++++ fs/btrfs/extent-tree.c | 10 +++++++++- fs/btrfs/qgroup.c | 9 ++++++--- fs/btrfs/transaction.c | 3 +++ fs/btrfs/transaction.h | 10 ++++++++++ 5 files changed, 48 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index 1c977e6d45dc..aa516b6a8fa1 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -14,6 +14,23 @@ #define BTRFS_ADD_DELAYED_EXTENT 3 /* record a full extent allocation */ #define BTRFS_UPDATE_DELAYED_HEAD 4 /* not changing ref count on head ref */ +/* + * Soft threshold to commit transaction. + * + * Since qgroup has to do its accounting at commit transaction time, it can + * greatly impact system performance. + * Thus for use cases like subvolume drop, we need to throttle the number of + * dirty extents in one transaction. + * + * The fixed 4K number here means we can handle 2K tree block CoWs (32MiB + * for 16K nodesize). Or 16M (4K data extemt size) ~ 512G (128M extent size). + * And during test, 4K dirty qgroup extents would make + * btrfs_qgroup_account_extents() to take around 150ms to excute. + * + * This 4K is kinda OK for the balanced latency and extent amount. + */ +#define BTRFS_QGROUP_DIRTY_EXTENTS_THRESHOLD (SZ_4K) + struct btrfs_delayed_ref_node { struct rb_node ref_node; /* @@ -174,6 +191,9 @@ struct btrfs_delayed_ref_root { * the snapshot in new_root/old_roots or it will get calculated twice */ u64 qgroup_to_skip; + + /* Record how many dirty extents we have in dirty_extent_root */ + atomic64_t nr_dirty_extents; }; enum btrfs_ref_type { diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 56ea380f5a17..5a7d7c02c1db 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5539,7 +5539,15 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc) goto out_end_trans; } - btrfs_end_transaction_throttle(trans); + if (btrfs_should_commit_trans(trans)) { + ret = btrfs_commit_transaction(trans); + if (ret < 0) { + err = ret; + goto out_end_trans; + } + } else { + btrfs_end_transaction_throttle(trans); + } if (!for_reloc && btrfs_need_cleaner_sleep(fs_info)) { btrfs_debug(fs_info, "drop snapshot early exit"); diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index bf4b02a40ecc..cff484bf4f80 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -1642,6 +1642,7 @@ int btrfs_qgroup_trace_extent_nolock(struct btrfs_fs_info *fs_info, rb_link_node(&record->node, parent_node, p); rb_insert_color(&record->node, &delayed_refs->dirty_extent_root); + atomic64_inc(&delayed_refs->nr_dirty_extents); return 0; } @@ -2548,17 +2549,17 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans) struct btrfs_delayed_ref_root *delayed_refs; struct ulist *new_roots = NULL; struct rb_node *node; - u64 num_dirty_extents = 0; + u64 num_dirty_extents; u64 qgroup_to_skip; int ret = 0; delayed_refs = &trans->transaction->delayed_refs; + num_dirty_extents = atomic64_read(&delayed_refs->nr_dirty_extents); qgroup_to_skip = delayed_refs->qgroup_to_skip; while ((node = rb_first(&delayed_refs->dirty_extent_root))) { record = rb_entry(node, struct btrfs_qgroup_extent_record, node); - num_dirty_extents++; trace_btrfs_qgroup_account_extents(fs_info, record); if (!ret) { @@ -2607,10 +2608,11 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans) new_roots = NULL; rb_erase(node, &delayed_refs->dirty_extent_root); kfree(record); - + atomic64_dec(&delayed_refs->nr_dirty_extents); } trace_qgroup_num_dirty_extents(fs_info, trans->transid, num_dirty_extents); + ASSERT(!atomic64_read(&delayed_refs->nr_dirty_extents)); return ret; } @@ -4179,4 +4181,5 @@ void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans) ulist_free(entry->old_roots); kfree(entry); } + atomic64_set(&trans->delayed_refs.nr_dirty_extents, 0); } diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index b671ea4d80e1..4ebbf16c3113 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -354,6 +354,7 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info, cur_trans->delayed_refs.href_root = RB_ROOT_CACHED; cur_trans->delayed_refs.dirty_extent_root = RB_ROOT; atomic_set(&cur_trans->delayed_refs.num_entries, 0); + atomic64_set(&cur_trans->delayed_refs.nr_dirty_extents, 0); /* * although the tree mod log is per file system and not per transaction, @@ -912,6 +913,8 @@ int btrfs_should_end_transaction(struct btrfs_trans_handle *trans) { struct btrfs_transaction *cur_trans = trans->transaction; + if (btrfs_should_commit_trans(trans)) + return 1; smp_mb(); if (cur_trans->state >= TRANS_STATE_COMMIT_START || cur_trans->delayed_refs.flushing) diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 8241c050ba71..e9857182333d 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -197,6 +197,16 @@ static inline void btrfs_clear_skip_qgroup(struct btrfs_trans_handle *trans) delayed_refs->qgroup_to_skip = 0; } +static inline bool btrfs_should_commit_trans(struct btrfs_trans_handle *trans) +{ + struct btrfs_transaction *cur_trans = trans->transaction; + + if (atomic64_read(&cur_trans->delayed_refs.nr_dirty_extents) >= + BTRFS_QGROUP_DIRTY_EXTENTS_THRESHOLD) + return true; + return false; +} + int btrfs_end_transaction(struct btrfs_trans_handle *trans); struct btrfs_trans_handle *btrfs_start_transaction(struct btrfs_root *root, unsigned int num_items);