[02/10] btrfs: make hole and data seeking a lot more efficient

From: Filipe Manana <fdmanana@suse.com>

From: Filipe Manana <fdmanana@suse.com>

The current implementation of hole and data seeking for llseek does not
scale well in regards to the number of extents and the distance between
the start offset and the next hole or extent. This is due to a very high
algorithmic complexity. Often we also get reports of btrfs' hole and data
seeking (llseek) being too slow, such as at 2017's LSFMM (see the Link
tag at the bottom).

In order to better understand it, lets consider the case where the start
offset is 0, we are seeking for a hole and the file size is 16G. Between
file offset 0 and the first hole in the file there are 100K extents - this
is common for large files, specially if we have compression enabled, since
the maximum extent size is limited to 128K. The steps take by the main
loop of the current algorithm are the following:

1) We start by calling btrfs_get_extent_fiemap(), for file offset 0, which
   calls btrfs_get_extent(). This will first lookup for an extent map in
   the inode's extent map tree (a red black tree). If the extent map is
   not loaded in memory, then it will do a lookup for the corresponding
   file extent item in the subvolume's b+tree, create an extent map based
   on the contents of the file extent item and then add the extent map to
   the extent map tree of the inode;

2) The second iteration calls btrfs_get_extent_fiemap() again, this time
   with a start offset matching the end offset of the previous extent.
   Again, btrfs_get_extent() will first search the extent map tree, and
   if it doesn't find an extent map there, it will again search in the
   b+tree of the subvolume for a matching file extent item, build an
   extent map based on the file extent item, and add the extent map to
   to the extent map tree of the inode;

3) This repeats over and over until we find the first hole (when seeking
   for holes) or until we find the first extent (when seeking for data).

   If there no extent maps loaded in memory for each iteration, then on
   each iteration we do 1 extent map tree search, 1 b+tree search, plus
   1 more extent map tree traversal to insert an extent map - plus we
   allocate memory for the extent map.

   On each iteration we are growing the size of the extent map tree,
   making each future search slower, and also visiting the same b+tree
   leaves over and over again - taking into account with the default leaf
   size of 16K we can fit more than 200 file extent items in a leaf - so
   we can visit the same b+tree leaf 200+ times, on each visit walking
   down a path from the root to the leaf.

So it's easy to see that what we have now doesn't scale well. Also, it
loads an extent map for every file extent item into memory, which is not
efficient - we should add extents maps only when doing IO (writing or
reading file data).

This change implements a new algorithm which scales much better, and
works like this:

1) We iterate over the subvolume's b+tree, visiting each leaf that has
   file extent items once and only once;

2) For any file extent items found, that don't represent holes or prealloc
   extents, it will not search the extent map tree - there's no need at
   all for that - an extent map is just an in-memory representation of a
   file extent item;

3) When a hole is found, or a prealloc extent, it will check if there's
   delalloc for its range. For this it will search for EXTENT_DELALLOC
   bits in the inode's io tree and check the extent map tree - this is
   for accounting for unflushed delalloc and for flushed delalloc (the
   period between running delalloc and ordered extent completion),
   respectively. This is similar to what the current implementation does
   when it finds a hole or prealloc extent, but without creating extent
   maps and adding them to the extent map tree in case they are not
   loaded in memory;

4) It never allocates extent maps, or adds extent maps to the inode's
   extent map tree. This not only saves memory and time (from the tree
   insertions and allocations), but also eliminates the possibility of
   -ENOMEM due to allocating too many extent maps.

Part of this new code will also be used later for fiemap (which also
suffers similar scalability problems).

The following test example can be used to quickly measure the efficiency
before and after this patch:

    $ cat test-seek-hole.sh
    #!/bin/bash

    DEV=/dev/sdi
    MNT=/mnt/sdi

    mkfs.btrfs -f $DEV

    mount -o compress=lzo $DEV $MNT

    # 16G file -> 131073 compressed extents.
    xfs_io -f -c "pwrite -S 0xab -b 1M 0 16G" $MNT/foobar

    # Leave a 1M hole at file offset 15G.
    xfs_io -c "fpunch 15G 1M" $MNT/foobar

    # Unmount and mount again, so that we can test when there's no
    # metadata cached in memory.
    umount $MNT
    mount -o compress=lzo $DEV $MNT

    # Test seeking for hole from offset 0 (hole is at offset 15G).

    start=$(date +%s%N)
    xfs_io -c "seek -h 0" $MNT/foobar
    end=$(date +%s%N)
    dur=$(( (end - start) / 1000000 ))
    echo "Took $dur milliseconds to seek first hole (metadata not cached)"
    echo

    start=$(date +%s%N)
    xfs_io -c "seek -h 0" $MNT/foobar
    end=$(date +%s%N)
    dur=$(( (end - start) / 1000000 ))
    echo "Took $dur milliseconds to seek first hole (metadata cached)"
    echo

    umount $MNT

Before this change:

    $ ./test-seek-hole.sh
    (...)
    Whence	Result
    HOLE	16106127360
    Took 176 milliseconds to seek first hole (metadata not cached)

    Whence	Result
    HOLE	16106127360
    Took 17 milliseconds to seek first hole (metadata cached)

After this change:

    $ ./test-seek-hole.sh
    (...)
    Whence	Result
    HOLE	16106127360
    Took 43 milliseconds to seek first hole (metadata not cached)

    Whence	Result
    HOLE	16106127360
    Took 13 milliseconds to seek first hole (metadata cached)

That's about 4X faster when no metadata is cached and about 30% faster
when all metadata is cached.

In practice the differences may often be significantly higher, either due
to a higher number of extents in a file or because the subvolume's b+tree
is much bigger than in this example, where we only have one file.

Link: https://lwn.net/Articles/718805/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c | 437 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 406 insertions(+), 31 deletions(-)

Message ID	246cd5358b28e3e11a96fe2abd0a4a34840cdb85.1662022922.git.fdmanana@suse.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 01D2DC67868 for <linux-btrfs@archiver.kernel.org>; Thu, 1 Sep 2022 13:20:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234218AbiIANUR (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>); Thu, 1 Sep 2022 09:20:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48308 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232301AbiIANTt (ORCPT <rfc822;linux-btrfs@vger.kernel.org>); Thu, 1 Sep 2022 09:19:49 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 23DA1D66 for <linux-btrfs@vger.kernel.org>; Thu, 1 Sep 2022 06:18:38 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id EC48EB826A6 for <linux-btrfs@vger.kernel.org>; Thu, 1 Sep 2022 13:18:36 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 22CFCC433D7 for <linux-btrfs@vger.kernel.org>; Thu, 1 Sep 2022 13:18:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1662038315; bh=o/H6xS2LnkNtMhiMCRuKPLDjTz0ABghxIYdUAq0fIb0=; h=From:To:Subject:Date:In-Reply-To:References:From; b=RrLj20CvGEu4vkY4pTN5FXtKEac9WC2qdZVwL0t5M8yfNXsl1lm3uFzM/JZKoD/WD 6So+7QHFKtxiUqibiSK5Sr6ZhpCxzaqHFkvM/l2J8JQCecJs6O8xdttSEqISpJ0+pB kGHbfUO12w7PFMIxbDSOSX428Ghcz8aw7us1vG7ekCpLmzJ15jzpyAHsfBAwwSjeL/ fokWlxVTuzlY7Buv7qsvZYmbdVwdiCQfce48YKKVSIFD8L02akedeoeSUeuatnUi7P kLKDjQedYFRJ74BSNTYR9R8zeKYc7+hvqo81ja7tyeWpfD6Iufn3tCLidRhK8eGT8r YZwguvrLWof1A== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 02/10] btrfs: make hole and data seeking a lot more efficient Date: Thu, 1 Sep 2022 14:18:22 +0100 Message-Id: <246cd5358b28e3e11a96fe2abd0a4a34840cdb85.1662022922.git.fdmanana@suse.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <cover.1662022922.git.fdmanana@suse.com> References: <cover.1662022922.git.fdmanana@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <linux-btrfs.vger.kernel.org> X-Mailing-List: linux-btrfs@vger.kernel.org
Series	btrfs: make lseek and fiemap much more efficient \| expand [00/10] btrfs: make lseek and fiemap much more efficient [01/10] btrfs: allow hole and data seeking to be interruptible [02/10] btrfs: make hole and data seeking a lot more efficient [03/10] btrfs: remove check for impossible block start for an extent map at fiemap [04/10] btrfs: remove zero length check when entering fiemap [05/10] btrfs: properly flush delalloc when entering fiemap [06/10] btrfs: allow fiemap to be interruptible [07/10] btrfs: rename btrfs_check_shared() to a more descriptive name [08/10] btrfs: speedup checking for extent sharedness during fiemap [09/10] btrfs: skip unnecessary extent buffer sharedness checks during fiemap [10/10] btrfs: make fiemap more efficient and accurate reporting extent sharedness

[02/10] btrfs: make hole and data seeking a lot more efficient

Commit Message

Comments

Patch