[v2,0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time

Message ID	20191008044909.157750-1-wqu@suse.com (mailing list archive)
Headers	show Return-Path: <SRS0=GeaY=YB=vger.kernel.org=linux-btrfs-owner@kernel.org> From: Qu Wenruo <wqu@suse.com> To: linux-btrfs@vger.kernel.org Subject: [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time Date: Tue, 8 Oct 2019 12:49:06 +0800 Message-Id: <20191008044909.157750-1-wqu@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk
Series	btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time \| expand [v2,0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time [v2,1/3] btrfs: block-group: Refactor btrfs_read_block_groups() [v2,2/3] btrfs: disk-io: Remove unnecessary check before freeing chunk root [v2,3/3] btrfs: Introduce new incompat feature, BG_TREE, to speed up mount time

Qu Wenruo Oct. 8, 2019, 4:49 a.m. UTC

This patchset can be fetched from:
https://github.com/adam900710/linux/tree/bg_tree
Which is based on v5.4-rc1 tag.

This patchset will hugely reduce mount time of large fs by putting all
block group items into its own tree.

The old behavior will try to read out all block group items at mount
time, however due to the key of block group items are scattered across
tons of extent items, we must call btrfs_search_slot() for each block
group.

It works fine for small fs, but when number of block groups goes beyond
200, such tree search will become a random read, causing obvious slow
down.

On the other hand, btrfs_read_chunk_tree() is still very fast, since we
put CHUNK_ITEMS into their own tree and package them next to each other.

Following this idea, we could do the same thing for block group items,
so instead of triggering btrfs_search_slot() for each block group, we
just call btrfs_next_item() and under most case we could finish in
memory, and hugely speed up mount (see BENCHMARK below).

The only disadvantage is, this method introduce an incompatible feature,
so existing fs can't use this feature directly.
Either specify it at mkfs time, or use btrfs-progs offline convert tool.

[[Benchmark]]
Since I have upgraded my rig to all NVME storage, there is no HDD
test result.

Physical device:	NVMe SSD
VM device:		VirtIO block device, backup by sparse file
Nodesize:		4K  (to bump up tree height)
Extent data size:	4M
Fs size used:		1T

All file extents on disk is in 4M size, preallocated to reduce space usage
(as the VM uses loopback block device backed by sparse file)

Without patchset:
Use ftrace function graph:

 7)               |  open_ctree [btrfs]() {
 7)               |    btrfs_read_block_groups [btrfs]() {
 7) @ 805851.8 us |    }
 7) @ 911890.2 us |  }

 btrfs_read_block_groups() takes 88% of the total mount time,

With patchset, and use -O bg-tree mkfs option:

 6)               |  open_ctree [btrfs]() {
 6)               |    btrfs_read_block_groups [btrfs]() {
 6) * 91204.69 us |    }
 6) @ 192039.5 us |  }

  open_ctree() time is only 21% of original mount time.
  And btrfs_read_block_groups() only takes 47% of total open_ctree()
  execution time.

The reason is pretty obvious when considering how many tree blocks needs
to be read from disk:
- Original extent tree:
  nodes:	55
  leaves:	1025
  total:	1080
- Block group tree:
  nodes:	1
  leaves:	13
  total:	14

Not to mention all the tree blocks readahead works pretty fine for bg
tree, as we will read every item.
While readahead for extent tree will just be a diaster, as all block
groups are scatter across the whole extent tree.

Changelog:
v2:
- Rebase to v5.4-rc1
  Minor conflicts due to code moved to block-group.c
- Fix a bug where some block groups will not be loaded at mount time
  It's a bug in that refactor patch, not exposed by previous round of
  tests.
- Add a new patch to remove a dead check
- Update benchmark to NVMe based result
  Hardware upgrade is not always a good thing for benchmark.

Qu Wenruo (3):
  btrfs: block-group: Refactor btrfs_read_block_groups()
  btrfs: disk-io: Remove unnecessary check before freeing chunk root
  btrfs: Introduce new incompat feature, BG_TREE, to speed up mount time

 fs/btrfs/block-group.c          | 306 ++++++++++++++++++++------------
 fs/btrfs/ctree.h                |   5 +-
 fs/btrfs/disk-io.c              |  16 +-
 fs/btrfs/sysfs.c                |   2 +
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |   3 +
 6 files changed, 213 insertions(+), 120 deletions(-)

Johannes Thumshirn Oct. 8, 2019, 9:14 a.m. UTC | #1

> [[Benchmark]]
> Since I have upgraded my rig to all NVME storage, there is no HDD
> test result.
> 
> Physical device:	NVMe SSD
> VM device:		VirtIO block device, backup by sparse file
> Nodesize:		4K  (to bump up tree height)
> Extent data size:	4M
> Fs size used:		1T
> 
> All file extents on disk is in 4M size, preallocated to reduce space usage
> (as the VM uses loopback block device backed by sparse file)

Do you have a some additional details about the test setup? I tried to
do the same (testing) for a bug Felix (added to Cc) reported to my at
the ALPSS Conference and I couldn't reproduce the issue.

My testing was a 100TB sparse file passed into a VM and running this
script to touch all blockgroups:

#!/bin/sh

FILE=/mnt/test

add_dirty_bg() {
        off="$1"
        len="$2"
        touch $FILE
        xfs_io -c "falloc $off $len" $FILE
        rm $FILE
}

mkfs.btrfs /dev/vda
mount /dev/vda /mnt

for ((i = 1; i < 100000; i++)); do
        add_dirty_bg $i"G" "1G"
done

umount /mnt

Qu Wenruo Oct. 8, 2019, 9:26 a.m. UTC | #2

On 2019/10/8 下午5:14, Johannes Thumshirn wrote:
>> [[Benchmark]]
>> Since I have upgraded my rig to all NVME storage, there is no HDD
>> test result.
>>
>> Physical device:	NVMe SSD
>> VM device:		VirtIO block device, backup by sparse file
>> Nodesize:		4K  (to bump up tree height)
>> Extent data size:	4M
>> Fs size used:		1T
>>
>> All file extents on disk is in 4M size, preallocated to reduce space usage
>> (as the VM uses loopback block device backed by sparse file)
> 
> Do you have a some additional details about the test setup? I tried to
> do the same (testing) for a bug Felix (added to Cc) reported to my at
> the ALPSS Conference and I couldn't reproduce the issue.
> 
> My testing was a 100TB sparse file passed into a VM and running this
> script to touch all blockgroups:

Here is my test scripts:
---
#!/bin/bash

dev="/dev/vdb"
mnt="/mnt/btrfs"

nr_subv=16
nr_extents=16384
extent_size=$((4 * 1024 * 1024)) # 4M

_fail()
{
        echo "!!! FAILED: $@ !!!"
        exit 1
}

fill_one_subv()
{
        path=$1
        if [ -z $path ]; then
                _fail "wrong parameter for fill_one_subv"
        fi
        btrfs subv create $path || _fail "create subv"

        for i in $(seq 0 $((nr_extents - 1))); do
                fallocate -o $((i * $extent_size)) -l $extent_size
$path/file || _fail "fallocate"
        done
}

declare -a pids
umount $mnt &> /dev/null
umount $dev &> /dev/null

#~/btrfs-progs/mkfs.btrfs -f -n 4k $dev -O bg-tree
mkfs.btrfs -f -n 4k $dev
mount $dev $mnt -o nospace_cache

for i in $(seq 1 $nr_subv); do
        fill_one_subv $mnt/subv_${i} &
        pids[$i]=$!
done

for i in $(seq 1 $nr_subv); do
        wait ${pids[$i]}
done
sync
umount $dev

---

> 
> #!/bin/sh
> 
> FILE=/mnt/test
> 
> add_dirty_bg() {
>         off="$1"
>         len="$2"
>         touch $FILE
>         xfs_io -c "falloc $off $len" $FILE
>         rm $FILE
> }
> 
> mkfs.btrfs /dev/vda
> mount /dev/vda /mnt
> 
> for ((i = 1; i < 100000; i++)); do
>         add_dirty_bg $i"G" "1G"
> done

This wont really build a good enough extent tree layout.

1G fallocate will only cause 8 128M file extents, thus 8 EXTENT_ITEMs.

Thus a leaf (16K by default) can still contain a lot of BLOCK_GROUPS all
together.

To build a case to really show the problem, you'll need a lot of
EXTENT_ITEM/METADATA_ITEMS to fill the gaps between BLOCK_GROUPS.

My test scripts did that, but may still not represent the real world, as
real world can cause even smaller extents due to snapshots.

Thanks,
Qu

> 
> umount /mnt
> 
> 
>

Johannes Thumshirn Oct. 8, 2019, 9:47 a.m. UTC | #3

On 08/10/2019 11:26, Qu Wenruo wrote:
> 
> 
> On 2019/10/8 下午5:14, Johannes Thumshirn wrote:
>>> [[Benchmark]]
>>> Since I have upgraded my rig to all NVME storage, there is no HDD
>>> test result.
>>>
>>> Physical device:	NVMe SSD
>>> VM device:		VirtIO block device, backup by sparse file
>>> Nodesize:		4K  (to bump up tree height)
>>> Extent data size:	4M
>>> Fs size used:		1T
>>>
>>> All file extents on disk is in 4M size, preallocated to reduce space usage
>>> (as the VM uses loopback block device backed by sparse file)
>>
>> Do you have a some additional details about the test setup? I tried to
>> do the same (testing) for a bug Felix (added to Cc) reported to my at
>> the ALPSS Conference and I couldn't reproduce the issue.
>>
>> My testing was a 100TB sparse file passed into a VM and running this
>> script to touch all blockgroups:
> 
> Here is my test scripts:
> ---
> #!/bin/bash
> 
> dev="/dev/vdb"
> mnt="/mnt/btrfs"
> 
> nr_subv=16
> nr_extents=16384
> extent_size=$((4 * 1024 * 1024)) # 4M
> 
> _fail()
> {
>         echo "!!! FAILED: $@ !!!"
>         exit 1
> }
> 
> fill_one_subv()
> {
>         path=$1
>         if [ -z $path ]; then
>                 _fail "wrong parameter for fill_one_subv"
>         fi
>         btrfs subv create $path || _fail "create subv"
> 
>         for i in $(seq 0 $((nr_extents - 1))); do
>                 fallocate -o $((i * $extent_size)) -l $extent_size
> $path/file || _fail "fallocate"
>         done
> }
> 
> declare -a pids
> umount $mnt &> /dev/null
> umount $dev &> /dev/null
> 
> #~/btrfs-progs/mkfs.btrfs -f -n 4k $dev -O bg-tree
> mkfs.btrfs -f -n 4k $dev
> mount $dev $mnt -o nospace_cache
> 
> for i in $(seq 1 $nr_subv); do
>         fill_one_subv $mnt/subv_${i} &
>         pids[$i]=$!
> done
> 
> for i in $(seq 1 $nr_subv); do
>         wait ${pids[$i]}
> done
> sync
> umount $dev
> 
> ---
> 
>>
>> #!/bin/sh
>>
>> FILE=/mnt/test
>>
>> add_dirty_bg() {
>>         off="$1"
>>         len="$2"
>>         touch $FILE
>>         xfs_io -c "falloc $off $len" $FILE
>>         rm $FILE
>> }
>>
>> mkfs.btrfs /dev/vda
>> mount /dev/vda /mnt
>>
>> for ((i = 1; i < 100000; i++)); do
>>         add_dirty_bg $i"G" "1G"
>> done
> 
> This wont really build a good enough extent tree layout.
> 
> 1G fallocate will only cause 8 128M file extents, thus 8 EXTENT_ITEMs.
> 
> Thus a leaf (16K by default) can still contain a lot of BLOCK_GROUPS all
> together.
> 
> To build a case to really show the problem, you'll need a lot of
> EXTENT_ITEM/METADATA_ITEMS to fill the gaps between BLOCK_GROUPS.
> 
> My test scripts did that, but may still not represent the real world, as
> real world can cause even smaller extents due to snapshots.
> 

Ah thanks for the explanation. I'll give your testscript a try.

Qu Wenruo Oct. 9, 2019, 7:43 a.m. UTC | #4

On 2019/10/9 下午3:07, Felix Niederwanger wrote:
> Hey Johannes,
> 
> glad to hear back from you :-)
> 
> As discussed I try to elaborate the setup where we experienced the
> issue, that btrfs mount takes more than 5 minutes. The initial bug
> report is at https://bugzilla.opensuse.org/show_bug.cgi?id=1143865
> 
> Physical device:             Hardware RAID controller ARECA-1883 PCIe 3.0 to SAS/SATA 12Gb RAID Controller
>                              Hardware RAID6+HotSpare, 8 TB Seagate IronWolf NAS HDD
> Installed System:            OPENSUSE LEAP 15.1
> Disks:                       / is on a separate DOM
>                              /dev/sda1 is the affected btrfs volume
> 
> Disk layout
> 
> sda      8:0    0 98.2T  0 disk 
> └─sda1   8:1    0 81.9T  0 part /ESO-RAID

How much space is used?

With enough space used (especially your tens of TB used), it's pretty
easy to have too many block groups items to overload the extent tree.

There is a better tool to explore the problem easier:
https://github.com/adam900710/btrfs-progs/tree/account_bgs

You can compile the btrfs-corrupt-block tool, and then:
# ./btrfs-corrupt-block -X /dev/sda1

It's recommended to call it with fs unmounted.

Then it should output something like:
extent_tree: total=1080 leaves=1025

Then please post that line to surprise us.

It shows how many unique tree blocks are needed to be read from disk,
just for iterating the block group items.

You could consider it as how many random IO needs to be done in nodesize
(normally 16K).

> sdb      8:16   0   59G  0 disk 
> ├─sdb1   8:17   0    1G  0 part /boot
> ├─sdb2   8:18   0  5.9G  0 part [SWAP]
> └─sdb3   8:19   0 52.1G  0 part /
> 
> System configuration : Opensuse LEAP 15.1 with "Server" configuration,
> installed NFS server.
> 
> I copied data from the old NAS (separate server, xfs volume) to the new
> btrfs volume using rsync.

If you are willing to/have enough spare space to test, you could try
that my latest bg-tree feature, to see if it would solve the problem.

My not-so-optimized guess that feature would reduce mount time to around
1min.
My average guess is, around 30s.

Thanks,
Qu

> Then I performed a system update with zypper, rebooted and run into the
> problems described in
> https://bugzilla.opensuse.org/show_bug.cgi?id=1143865. In short: Boot
> failed, because mounting /ESO-RAID run into a 5 minutes timeout. Manual
> mount worked fine (but took up to 6 minutes) and the filesystem was
> completely unresponsive. See the bug report for more details about what
> became unresponsive.
> 
> A movie of the failing boot process is still on my webserver:
> ftp://feldspaten.org/dump/20190803_btrfs_balance_issue/btrfs_openctree_failed.mp4
> 
> 
> I hope this contributes to reproduce the issue. Feel free to contact me
> if you need further details,
> 
> Greetings,
> Felix :-)
> 
> 
> On 10/8/19 11:47 AM, Johannes Thumshirn wrote:
>> On 08/10/2019 11:26, Qu Wenruo wrote:
>>> On 2019/10/8 下午5:14, Johannes Thumshirn wrote:
>>>>> [[Benchmark]]
>>>>> Since I have upgraded my rig to all NVME storage, there is no HDD
>>>>> test result.
>>>>>
>>>>> Physical device:	NVMe SSD
>>>>> VM device:		VirtIO block device, backup by sparse file
>>>>> Nodesize:		4K  (to bump up tree height)
>>>>> Extent data size:	4M
>>>>> Fs size used:		1T
>>>>>
>>>>> All file extents on disk is in 4M size, preallocated to reduce space usage
>>>>> (as the VM uses loopback block device backed by sparse file)
>>>> Do you have a some additional details about the test setup? I tried to
>>>> do the same (testing) for a bug Felix (added to Cc) reported to my at
>>>> the ALPSS Conference and I couldn't reproduce the issue.
>>>>
>>>> My testing was a 100TB sparse file passed into a VM and running this
>>>> script to touch all blockgroups:
>>> Here is my test scripts:
>>> ---
>>> #!/bin/bash
>>>
>>> dev="/dev/vdb"
>>> mnt="/mnt/btrfs"
>>>
>>> nr_subv=16
>>> nr_extents=16384
>>> extent_size=$((4 * 1024 * 1024)) # 4M
>>>
>>> _fail()
>>> {
>>>         echo "!!! FAILED: $@ !!!"
>>>         exit 1
>>> }
>>>
>>> fill_one_subv()
>>> {
>>>         path=$1
>>>         if [ -z $path ]; then
>>>                 _fail "wrong parameter for fill_one_subv"
>>>         fi
>>>         btrfs subv create $path || _fail "create subv"
>>>
>>>         for i in $(seq 0 $((nr_extents - 1))); do
>>>                 fallocate -o $((i * $extent_size)) -l $extent_size
>>> $path/file || _fail "fallocate"
>>>         done
>>> }
>>>
>>> declare -a pids
>>> umount $mnt &> /dev/null
>>> umount $dev &> /dev/null
>>>
>>> #~/btrfs-progs/mkfs.btrfs -f -n 4k $dev -O bg-tree
>>> mkfs.btrfs -f -n 4k $dev
>>> mount $dev $mnt -o nospace_cache
>>>
>>> for i in $(seq 1 $nr_subv); do
>>>         fill_one_subv $mnt/subv_${i} &
>>>         pids[$i]=$!
>>> done
>>>
>>> for i in $(seq 1 $nr_subv); do
>>>         wait ${pids[$i]}
>>> done
>>> sync
>>> umount $dev
>>>
>>> ---
>>>
>>>> #!/bin/sh
>>>>
>>>> FILE=/mnt/test
>>>>
>>>> add_dirty_bg() {
>>>>         off="$1"
>>>>         len="$2"
>>>>         touch $FILE
>>>>         xfs_io -c "falloc $off $len" $FILE
>>>>         rm $FILE
>>>> }
>>>>
>>>> mkfs.btrfs /dev/vda
>>>> mount /dev/vda /mnt
>>>>
>>>> for ((i = 1; i < 100000; i++)); do
>>>>         add_dirty_bg $i"G" "1G"
>>>> done
>>> This wont really build a good enough extent tree layout.
>>>
>>> 1G fallocate will only cause 8 128M file extents, thus 8 EXTENT_ITEMs.
>>>
>>> Thus a leaf (16K by default) can still contain a lot of BLOCK_GROUPS all
>>> together.
>>>
>>> To build a case to really show the problem, you'll need a lot of
>>> EXTENT_ITEM/METADATA_ITEMS to fill the gaps between BLOCK_GROUPS.
>>>
>>> My test scripts did that, but may still not represent the real world, as
>>> real world can cause even smaller extents due to snapshots.
>>>
>> Ah thanks for the explanation. I'll give your testscript a try.
>>
>>

Felix Niederwanger Oct. 9, 2019, 8:08 a.m. UTC | #5

Hi Qu,

I'm afraid the system is now in a very different configuration and
already in production, which makes it impossible to run specific tests
on it.

At the time the problem occurred, we were using about 57/82 TB filled
with approx 20-30 million individual files with varying file sized. I
created a histogram of the most common unsed filesizes for a small subset:

# find . -type f -print0 | xargs -0 ls -l | awk
'{size[int(log($5)/log(2))]++}END{for (i in size) printf("%10d %3d\n",
2^i, size[i])}' | sort -n

         0 7992
         1 3072
         2   4
         8 1536
       128   2
       512  22
      1024 4600
      2048 3341
      4096 5671
      8192 940
     16384 6535
     32768 700
     65536  17
    131072 3843
    262144 3362
    524288 2143
   1048576 1169
   2097152 856
   4194304 168
   8388608 5579
  16777216 4052
  33554432 604
  67108864 890

26240 out of 57098 files (45%) are <=4k in size. This should be
representative for most of the files on the affected volume.

The affected system is now in production (xfs on the same RAID6) and as
I left university, it's unfortunately impossible to run any more tests
on that particular system.

Greetings,
Felix


On 10/9/19 9:43 AM, Qu Wenruo wrote:
>
> On 2019/10/9 下午3:07, Felix Niederwanger wrote:
>> Hey Johannes,
>>
>> glad to hear back from you :-)
>>
>> As discussed I try to elaborate the setup where we experienced the
>> issue, that btrfs mount takes more than 5 minutes. The initial bug
>> report is at https://bugzilla.opensuse.org/show_bug.cgi?id=1143865
>>
>> Physical device:             Hardware RAID controller ARECA-1883 PCIe 3.0 to SAS/SATA 12Gb RAID Controller
>>                              Hardware RAID6+HotSpare, 8 TB Seagate IronWolf NAS HDD
>> Installed System:            OPENSUSE LEAP 15.1
>> Disks:                       / is on a separate DOM
>>                              /dev/sda1 is the affected btrfs volume
>>
>> Disk layout
>>
>> sda      8:0    0 98.2T  0 disk 
>> └─sda1   8:1    0 81.9T  0 part /ESO-RAID
> How much space is used?
>
> With enough space used (especially your tens of TB used), it's pretty
> easy to have too many block groups items to overload the extent tree.
>
> There is a better tool to explore the problem easier:
> https://github.com/adam900710/btrfs-progs/tree/account_bgs
>
> You can compile the btrfs-corrupt-block tool, and then:
> # ./btrfs-corrupt-block -X /dev/sda1
>
> It's recommended to call it with fs unmounted.
>
> Then it should output something like:
> extent_tree: total=1080 leaves=1025
>
> Then please post that line to surprise us.
>
> It shows how many unique tree blocks are needed to be read from disk,
> just for iterating the block group items.
>
> You could consider it as how many random IO needs to be done in nodesize
> (normally 16K).
>
>> sdb      8:16   0   59G  0 disk 
>> ├─sdb1   8:17   0    1G  0 part /boot
>> ├─sdb2   8:18   0  5.9G  0 part [SWAP]
>> └─sdb3   8:19   0 52.1G  0 part /
>>
>> System configuration : Opensuse LEAP 15.1 with "Server" configuration,
>> installed NFS server.
>>
>> I copied data from the old NAS (separate server, xfs volume) to the new
>> btrfs volume using rsync.
> If you are willing to/have enough spare space to test, you could try
> that my latest bg-tree feature, to see if it would solve the problem.
>
> My not-so-optimized guess that feature would reduce mount time to around
> 1min.
> My average guess is, around 30s.
>
> Thanks,
> Qu
>
>> Then I performed a system update with zypper, rebooted and run into the
>> problems described in
>> https://bugzilla.opensuse.org/show_bug.cgi?id=1143865. In short: Boot
>> failed, because mounting /ESO-RAID run into a 5 minutes timeout. Manual
>> mount worked fine (but took up to 6 minutes) and the filesystem was
>> completely unresponsive. See the bug report for more details about what
>> became unresponsive.
>>
>> A movie of the failing boot process is still on my webserver:
>> ftp://feldspaten.org/dump/20190803_btrfs_balance_issue/btrfs_openctree_failed.mp4
>>
>>
>> I hope this contributes to reproduce the issue. Feel free to contact me
>> if you need further details,
>>
>> Greetings,
>> Felix :-)
>>
>>
>> On 10/8/19 11:47 AM, Johannes Thumshirn wrote:
>>> On 08/10/2019 11:26, Qu Wenruo wrote:
>>>> On 2019/10/8 下午5:14, Johannes Thumshirn wrote:
>>>>>> [[Benchmark]]
>>>>>> Since I have upgraded my rig to all NVME storage, there is no HDD
>>>>>> test result.
>>>>>>
>>>>>> Physical device:	NVMe SSD
>>>>>> VM device:		VirtIO block device, backup by sparse file
>>>>>> Nodesize:		4K  (to bump up tree height)
>>>>>> Extent data size:	4M
>>>>>> Fs size used:		1T
>>>>>>
>>>>>> All file extents on disk is in 4M size, preallocated to reduce space usage
>>>>>> (as the VM uses loopback block device backed by sparse file)
>>>>> Do you have a some additional details about the test setup? I tried to
>>>>> do the same (testing) for a bug Felix (added to Cc) reported to my at
>>>>> the ALPSS Conference and I couldn't reproduce the issue.
>>>>>
>>>>> My testing was a 100TB sparse file passed into a VM and running this
>>>>> script to touch all blockgroups:
>>>> Here is my test scripts:
>>>> ---
>>>> #!/bin/bash
>>>>
>>>> dev="/dev/vdb"
>>>> mnt="/mnt/btrfs"
>>>>
>>>> nr_subv=16
>>>> nr_extents=16384
>>>> extent_size=$((4 * 1024 * 1024)) # 4M
>>>>
>>>> _fail()
>>>> {
>>>>         echo "!!! FAILED: $@ !!!"
>>>>         exit 1
>>>> }
>>>>
>>>> fill_one_subv()
>>>> {
>>>>         path=$1
>>>>         if [ -z $path ]; then
>>>>                 _fail "wrong parameter for fill_one_subv"
>>>>         fi
>>>>         btrfs subv create $path || _fail "create subv"
>>>>
>>>>         for i in $(seq 0 $((nr_extents - 1))); do
>>>>                 fallocate -o $((i * $extent_size)) -l $extent_size
>>>> $path/file || _fail "fallocate"
>>>>         done
>>>> }
>>>>
>>>> declare -a pids
>>>> umount $mnt &> /dev/null
>>>> umount $dev &> /dev/null
>>>>
>>>> #~/btrfs-progs/mkfs.btrfs -f -n 4k $dev -O bg-tree
>>>> mkfs.btrfs -f -n 4k $dev
>>>> mount $dev $mnt -o nospace_cache
>>>>
>>>> for i in $(seq 1 $nr_subv); do
>>>>         fill_one_subv $mnt/subv_${i} &
>>>>         pids[$i]=$!
>>>> done
>>>>
>>>> for i in $(seq 1 $nr_subv); do
>>>>         wait ${pids[$i]}
>>>> done
>>>> sync
>>>> umount $dev
>>>>
>>>> ---
>>>>
>>>>> #!/bin/sh
>>>>>
>>>>> FILE=/mnt/test
>>>>>
>>>>> add_dirty_bg() {
>>>>>         off="$1"
>>>>>         len="$2"
>>>>>         touch $FILE
>>>>>         xfs_io -c "falloc $off $len" $FILE
>>>>>         rm $FILE
>>>>> }
>>>>>
>>>>> mkfs.btrfs /dev/vda
>>>>> mount /dev/vda /mnt
>>>>>
>>>>> for ((i = 1; i < 100000; i++)); do
>>>>>         add_dirty_bg $i"G" "1G"
>>>>> done
>>>> This wont really build a good enough extent tree layout.
>>>>
>>>> 1G fallocate will only cause 8 128M file extents, thus 8 EXTENT_ITEMs.
>>>>
>>>> Thus a leaf (16K by default) can still contain a lot of BLOCK_GROUPS all
>>>> together.
>>>>
>>>> To build a case to really show the problem, you'll need a lot of
>>>> EXTENT_ITEM/METADATA_ITEMS to fill the gaps between BLOCK_GROUPS.
>>>>
>>>> My test scripts did that, but may still not represent the real world, as
>>>> real world can cause even smaller extents due to snapshots.
>>>>
>>> Ah thanks for the explanation. I'll give your testscript a try.
>>>
>>>

Qu Wenruo Oct. 9, 2019, 11 a.m. UTC | #6

On 2019/10/9 下午4:08, Felix Niederwanger wrote:
> Hi Qu,
> 
> I'm afraid the system is now in a very different configuration and
> already in production, which makes it impossible to run specific tests
> on it.
> 
> At the time the problem occurred, we were using about 57/82 TB filled
> with approx 20-30 million individual files with varying file sized. I
> created a histogram of the most common unsed filesizes for a small subset:
> 
> # find . -type f -print0 | xargs -0 ls -l | awk
> '{size[int(log($5)/log(2))]++}END{for (i in size) printf("%10d %3d\n",
> 2^i, size[i])}' | sort -n
> 
>          0 7992
>          1 3072
>          2   4
>          8 1536
>        128   2
>        512  22
>       1024 4600
>       2048 3341
>       4096 5671
>       8192 940
>      16384 6535
>      32768 700
>      65536  17
>     131072 3843
>     262144 3362
>     524288 2143
>    1048576 1169
>    2097152 856
>    4194304 168
>    8388608 5579
>   16777216 4052
>   33554432 604
>   67108864 890
> 
> 26240 out of 57098 files (45%) are <=4k in size. This should be
> representative for most of the files on the affected volume.

That explains the problem.

There are 3 main factors contributing to the mount time:
- Number of block groups
  This directly affects how many tree search we need to do.

- Number of extents
  This affects how random the block group iteration will be.

- Disk IOPS performance
  Obviously, since bg iteration is mostly random IO, it's IOPS affecting
  the overall mount time.

In your case, your fs seems to have all the boxes checked.

Anyway, it still stands inside the assumption I have, although it's a
pity that we can't get a real world benchmark, it still contributes to
the motivation of bg-tree feature.

Thanks,
Qu
> 
> The affected system is now in production (xfs on the same RAID6) and as
> I left university, it's unfortunately impossible to run any more tests
> on that particular system.
> 
> Greetings,
> Felix
> 
> 
> On 10/9/19 9:43 AM, Qu Wenruo wrote:
>>
>> On 2019/10/9 下午3:07, Felix Niederwanger wrote:
>>> Hey Johannes,
>>>
>>> glad to hear back from you :-)
>>>
>>> As discussed I try to elaborate the setup where we experienced the
>>> issue, that btrfs mount takes more than 5 minutes. The initial bug
>>> report is at https://bugzilla.opensuse.org/show_bug.cgi?id=1143865
>>>
>>> Physical device:             Hardware RAID controller ARECA-1883 PCIe 3.0 to SAS/SATA 12Gb RAID Controller
>>>                              Hardware RAID6+HotSpare, 8 TB Seagate IronWolf NAS HDD
>>> Installed System:            OPENSUSE LEAP 15.1
>>> Disks:                       / is on a separate DOM
>>>                              /dev/sda1 is the affected btrfs volume
>>>
>>> Disk layout
>>>
>>> sda      8:0    0 98.2T  0 disk 
>>> └─sda1   8:1    0 81.9T  0 part /ESO-RAID
>> How much space is used?
>>
>> With enough space used (especially your tens of TB used), it's pretty
>> easy to have too many block groups items to overload the extent tree.
>>
>> There is a better tool to explore the problem easier:
>> https://github.com/adam900710/btrfs-progs/tree/account_bgs
>>
>> You can compile the btrfs-corrupt-block tool, and then:
>> # ./btrfs-corrupt-block -X /dev/sda1
>>
>> It's recommended to call it with fs unmounted.
>>
>> Then it should output something like:
>> extent_tree: total=1080 leaves=1025
>>
>> Then please post that line to surprise us.
>>
>> It shows how many unique tree blocks are needed to be read from disk,
>> just for iterating the block group items.
>>
>> You could consider it as how many random IO needs to be done in nodesize
>> (normally 16K).
>>
>>> sdb      8:16   0   59G  0 disk 
>>> ├─sdb1   8:17   0    1G  0 part /boot
>>> ├─sdb2   8:18   0  5.9G  0 part [SWAP]
>>> └─sdb3   8:19   0 52.1G  0 part /
>>>
>>> System configuration : Opensuse LEAP 15.1 with "Server" configuration,
>>> installed NFS server.
>>>
>>> I copied data from the old NAS (separate server, xfs volume) to the new
>>> btrfs volume using rsync.
>> If you are willing to/have enough spare space to test, you could try
>> that my latest bg-tree feature, to see if it would solve the problem.
>>
>> My not-so-optimized guess that feature would reduce mount time to around
>> 1min.
>> My average guess is, around 30s.
>>
>> Thanks,
>> Qu
>>
>>> Then I performed a system update with zypper, rebooted and run into the
>>> problems described in
>>> https://bugzilla.opensuse.org/show_bug.cgi?id=1143865. In short: Boot
>>> failed, because mounting /ESO-RAID run into a 5 minutes timeout. Manual
>>> mount worked fine (but took up to 6 minutes) and the filesystem was
>>> completely unresponsive. See the bug report for more details about what
>>> became unresponsive.
>>>
>>> A movie of the failing boot process is still on my webserver:
>>> ftp://feldspaten.org/dump/20190803_btrfs_balance_issue/btrfs_openctree_failed.mp4
>>>
>>>
>>> I hope this contributes to reproduce the issue. Feel free to contact me
>>> if you need further details,
>>>
>>> Greetings,
>>> Felix :-)
>>>
>>>
>>> On 10/8/19 11:47 AM, Johannes Thumshirn wrote:
>>>> On 08/10/2019 11:26, Qu Wenruo wrote:
>>>>> On 2019/10/8 下午5:14, Johannes Thumshirn wrote:
>>>>>>> [[Benchmark]]
>>>>>>> Since I have upgraded my rig to all NVME storage, there is no HDD
>>>>>>> test result.
>>>>>>>
>>>>>>> Physical device:	NVMe SSD
>>>>>>> VM device:		VirtIO block device, backup by sparse file
>>>>>>> Nodesize:		4K  (to bump up tree height)
>>>>>>> Extent data size:	4M
>>>>>>> Fs size used:		1T
>>>>>>>
>>>>>>> All file extents on disk is in 4M size, preallocated to reduce space usage
>>>>>>> (as the VM uses loopback block device backed by sparse file)
>>>>>> Do you have a some additional details about the test setup? I tried to
>>>>>> do the same (testing) for a bug Felix (added to Cc) reported to my at
>>>>>> the ALPSS Conference and I couldn't reproduce the issue.
>>>>>>
>>>>>> My testing was a 100TB sparse file passed into a VM and running this
>>>>>> script to touch all blockgroups:
>>>>> Here is my test scripts:
>>>>> ---
>>>>> #!/bin/bash
>>>>>
>>>>> dev="/dev/vdb"
>>>>> mnt="/mnt/btrfs"
>>>>>
>>>>> nr_subv=16
>>>>> nr_extents=16384
>>>>> extent_size=$((4 * 1024 * 1024)) # 4M
>>>>>
>>>>> _fail()
>>>>> {
>>>>>         echo "!!! FAILED: $@ !!!"
>>>>>         exit 1
>>>>> }
>>>>>
>>>>> fill_one_subv()
>>>>> {
>>>>>         path=$1
>>>>>         if [ -z $path ]; then
>>>>>                 _fail "wrong parameter for fill_one_subv"
>>>>>         fi
>>>>>         btrfs subv create $path || _fail "create subv"
>>>>>
>>>>>         for i in $(seq 0 $((nr_extents - 1))); do
>>>>>                 fallocate -o $((i * $extent_size)) -l $extent_size
>>>>> $path/file || _fail "fallocate"
>>>>>         done
>>>>> }
>>>>>
>>>>> declare -a pids
>>>>> umount $mnt &> /dev/null
>>>>> umount $dev &> /dev/null
>>>>>
>>>>> #~/btrfs-progs/mkfs.btrfs -f -n 4k $dev -O bg-tree
>>>>> mkfs.btrfs -f -n 4k $dev
>>>>> mount $dev $mnt -o nospace_cache
>>>>>
>>>>> for i in $(seq 1 $nr_subv); do
>>>>>         fill_one_subv $mnt/subv_${i} &
>>>>>         pids[$i]=$!
>>>>> done
>>>>>
>>>>> for i in $(seq 1 $nr_subv); do
>>>>>         wait ${pids[$i]}
>>>>> done
>>>>> sync
>>>>> umount $dev
>>>>>
>>>>> ---
>>>>>
>>>>>> #!/bin/sh
>>>>>>
>>>>>> FILE=/mnt/test
>>>>>>
>>>>>> add_dirty_bg() {
>>>>>>         off="$1"
>>>>>>         len="$2"
>>>>>>         touch $FILE
>>>>>>         xfs_io -c "falloc $off $len" $FILE
>>>>>>         rm $FILE
>>>>>> }
>>>>>>
>>>>>> mkfs.btrfs /dev/vda
>>>>>> mount /dev/vda /mnt
>>>>>>
>>>>>> for ((i = 1; i < 100000; i++)); do
>>>>>>         add_dirty_bg $i"G" "1G"
>>>>>> done
>>>>> This wont really build a good enough extent tree layout.
>>>>>
>>>>> 1G fallocate will only cause 8 128M file extents, thus 8 EXTENT_ITEMs.
>>>>>
>>>>> Thus a leaf (16K by default) can still contain a lot of BLOCK_GROUPS all
>>>>> together.
>>>>>
>>>>> To build a case to really show the problem, you'll need a lot of
>>>>> EXTENT_ITEM/METADATA_ITEMS to fill the gaps between BLOCK_GROUPS.
>>>>>
>>>>> My test scripts did that, but may still not represent the real world, as
>>>>> real world can cause even smaller extents due to snapshots.
>>>>>
>>>> Ah thanks for the explanation. I'll give your testscript a try.
>>>>
>>>>
>

[v2,0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time

Message

Comments