Message ID | 6fbdddf38bd353dba7eba2117573f3b74fb79e40.1605697030.git.fdmanana@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: unlock path before checking if extent is shared during nocow writeback | expand |
On Wed, Nov 18, 2020 at 11:00:17AM +0000, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When we are attempting to start writeback for an existing extent in NOCOW > mode, at run_delalloc_nocow(), we must check if the extent is shared, and > if it is, fallback to a COW write. However we do such check while still > holding a read lock on the leaf that contains the file extent item, and > that check, the call to btrfs_cross_ref_exist(), can take some time > because: > > 1) It needs to do a search on the extent tree, which obviously takes some > time, specially if delayed references are being run at the moment, as > we can block when trying to lock currently write locked btree nodes; > > 2) It needs to check the delayed references for any existing reference > for our data extent, this requires acquiring the delayed references' > spinlock and maybe block on the mutex of a delayed reference head in the > case where there is a delayed reference for our data extent, in the > worst case it makes us release the path on the extent tree and retry > the whole process again (going back to step 1). > > There are other operations we do while holding the leaf locked that can > take some significant time as well (specially all together): > > * btrfs_extent_readonly() - to check if the block group containing the > extent is currently in RO mode. This requires taking a spinlock and > searching for the block group in a rbtree that can be big on large > filesystems; > > * csum_exist_in_range() - to search if there are any checksums in the > csum tree for the extent. Like before, this can take some time if we are > in a filesystem that has both COW and NOCOW files, in which case the > csum tree is not empty; > > * btrfs_inc_nocow_writers() - increment the number of nocow writers in the > block group that contains the data extent. Needs to acquire a spinlock > and search for the block group in a rbtree that can be big on large > filesystems. > > So just unlock the leaf (release the path) before doing all those checks, > since we do not need it anymore. In case we can not do a NOCOW write for > the extent, due to any of those checks failing, and the writeback range > goes beyond that extents' length, we will do another btree search for the > next file extent item. > > The following script that calls dbench was used to measure the impact of > this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device > directly (no intermediary filesystem on the host) and using a non-debug > kernel (default configuration on Debian): > > $ cat test-dbench.sh > #!/bin/bash > > DEV=/dev/sdk > MNT=/mnt/sdk > MOUNT_OPTIONS="-o ssd -o nodatacow" > MKFS_OPTIONS="-m single -d single" > > mkfs.btrfs -f $MKFS_OPTIONS $DEV > mount $MOUNT_OPTIONS $DEV $MNT > > dbench -D $MNT -t 300 64 > > umount $MNT > > Before this change: > > Operation Count AvgLat MaxLat > ---------------------------------------- > NTCreateX 9326331 0.317 399.957 > Close 6851198 0.002 6.402 > Rename 394894 2.621 402.819 > Unlink 1883131 0.931 398.082 > Deltree 256 19.160 303.580 > Mkdir 128 0.003 0.016 > Qpathinfo 8452314 0.068 116.133 > Qfileinfo 1481921 0.001 5.081 > Qfsinfo 1549963 0.002 4.444 > Sfileinfo 759679 0.084 17.079 > Find 3268168 0.396 118.196 > WriteX 4653310 0.056 110.993 > ReadX 14618818 0.005 23.314 > LockX 30364 0.003 0.497 > UnlockX 30364 0.002 1.720 > Flush 653619 16.954 569.299 > > Throughput 966.651 MB/sec 64 clients 64 procs max_latency=569.377 ms > > After this change: > > Operation Count AvgLat MaxLat > ---------------------------------------- > NTCreateX 9710433 0.302 232.449 > Close 7132948 0.002 11.496 > Rename 411144 2.452 131.805 > Unlink 1960961 0.893 230.383 > Deltree 256 14.858 198.646 > Mkdir 128 0.002 0.005 > Qpathinfo 8800890 0.066 111.588 > Qfileinfo 1542556 0.001 3.852 > Qfsinfo 1613835 0.002 5.483 > Sfileinfo 790871 0.081 19.492 > Find 3402743 0.386 120.185 > WriteX 4842918 0.054 179.312 > ReadX 15220407 0.005 32.435 > LockX 31612 0.003 1.533 > UnlockX 31612 0.002 1.047 > Flush 680567 16.320 463.323 > > Throughput 1016.59 MB/sec 64 clients 64 procs max_latency=463.327 ms > > +5.0% throughput, -20.5% max latency > > Also, the following test using fio was run: > > $ cat test-fio.sh > #!/bin/bash > > DEV=/dev/sdk > MNT=/mnt/sdk > MOUNT_OPTIONS="-o ssd -o nodatacow" > MKFS_OPTIONS="-d single -m single" > > if [ $# -ne 4 ]; then > echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE" > exit 1 > fi > > NUM_JOBS=$1 > FILE_SIZE=$2 > FSYNC_FREQ=$3 > BLOCK_SIZE=$4 > > cat <<EOF > /tmp/fio-job.ini > [writers] > rw=randwrite > fsync=$FSYNC_FREQ > fallocate=none > group_reporting=1 > direct=0 > bs=$BLOCK_SIZE > ioengine=sync > size=$FILE_SIZE > directory=$MNT > numjobs=$NUM_JOBS > EOF > > echo > echo "Using fio config:" > echo > cat /tmp/fio-job.ini > echo > echo "mount options: $MOUNT_OPTIONS" > echo > > mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null > mount $MOUNT_OPTIONS $DEV $MNT > > echo "Creating nodatacow files before fio runs..." > for ((i = 0; i < $NUM_JOBS; i++)); do > xfs_io -f -c "pwrite -b 128M 0 $FILE_SIZE" "$MNT/writers.$i.0" > done > sync > > fio /tmp/fio-job.ini > umount $MNT > > Before this change: > > $ ./test-fio.sh 16 512M 2 4K > (...) > WRITE: bw=28.3MiB/s (29.6MB/s), 28.3MiB/s-28.3MiB/s (29.6MB/s-29.6MB/s), io=8192MiB (8590MB), run=289800-289800msec > > After this change: > > $ ./test-fio.sh 16 512M 2 4K > (...) > WRITE: bw=31.2MiB/s (32.7MB/s), 31.2MiB/s-31.2MiB/s (32.7MB/s-32.7MB/s), io=8192MiB (8590MB), run=262845-262845msec > > +9.7% throughput, -9.8% runtime > > Signed-off-by: Filipe Manana <fdmanana@suse.com> Nice, thank you. Added to misc-next.
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 0a2ee8983528..24c8ad7e0603 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1649,6 +1649,15 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, goto out_check; if (extent_type == BTRFS_FILE_EXTENT_REG && !force) goto out_check; + + /* + * The following checks can be expensive, as they need to + * take other locks and do btree or rbtree searches, so + * release the path to avoid blocking other tasks for too + * long. + */ + btrfs_release_path(path); + /* If extent is RO, we must COW it */ if (btrfs_extent_readonly(fs_info, disk_bytenr)) goto out_check; @@ -1724,12 +1733,12 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, cur_offset = extent_end; if (cur_offset > end) break; + if (!path->nodes[0]) + continue; path->slots[0]++; goto next_slot; } - btrfs_release_path(path); - /* * COW range from cow_start to found_key.offset - 1. As the key * will contain the beginning of the first extent that can be