[v3,08/10] btrfs: add a shrinker for extent maps

From: Filipe Manana <fdmanana@suse.com>

From: Filipe Manana <fdmanana@suse.com>

Extent maps are used either to represent existing file extent items, or to
represent new extents that are going to be written and the respective file
extent items are created when the ordered extent completes.

We currently don't have any limit for how many extent maps we can have,
neither per inode nor globally. Most of the time this not too noticeable
because extent maps are removed in the following situations:

1) When evicting an inode;

2) When releasing folios (pages) through the btrfs_release_folio() address
   space operation callback.

   However we won't release extent maps in the folio range if the folio is
   either dirty or under writeback or if the inode's i_size is less than
   or equals to 16M (see try_release_extent_mapping(). This 16M i_size
   constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
   extent_io and extent_state optimizations"), but there's no explanation
   about why we have it or why the 16M value.

This means that for buffered IO we can reach an OOM situation due to too
many extent maps if either of the following happens:

1) There's a set of tasks constantly doing IO on many files with a size
   not larger than 16M, specially if they keep the files open for very
   long periods, therefore preventing inode eviction.

   This requires a really high number of such files, and having many non
   mergeable extent maps (due to random 4K writes for example) and a
   machine with very little memory;

2) There's a set tasks constantly doing random write IO (therefore
   creating many non mergeable extent maps) on files and keeping them
   open for long periods of time, so inode eviction doesn't happen and
   there's always a lot of dirty pages or pages under writeback,
   preventing btrfs_release_folio() from releasing the respective extent
   maps.

This second case was actually reported in the thread pointed by the Link
tag below, and it requires a very large file under heavy IO and a machine
with very little amount of RAM, which is probably hard to happen in
practice in a real world use case.

However when using direct IO this is not so hard to happen, because the
page cache is not used, and therefore btrfs_release_folio() is never
called. Which means extent maps are dropped only when evicting the inode,
and that means that if we have tasks that keep a file descriptor open and
keep doing IO on a very large file (or files), we can exhaust memory due
to an unbounded amount of extent maps. This is especially easy to happen
if we have a huge file with millions of small extents and their extent
maps are not mergeable (non contiguous offsets and disk locations).
This was reported in that thread with the following fio test:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/sdj
   MNT=/mnt/sdj
   MOUNT_OPTIONS="-o ssd"
   MKFS_OPTIONS=""

   cat <<EOF > /tmp/fio-job.ini
   [global]
   name=fio-rand-write
   filename=$MNT/fio-rand-write
   rw=randwrite
   bs=4K
   direct=1
   numjobs=16
   fallocate=none
   time_based
   runtime=90000

   [file1]
   size=300G
   ioengine=libaio
   iodepth=16

   EOF

   umount $MNT &> /dev/null
   mkfs.btrfs -f $MKFS_OPTIONS $DEV
   mount $MOUNT_OPTIONS $DEV $MNT

   fio /tmp/fio-job.ini
   umount $MNT

Monitoring the btrfs_extent_map slab while running the test with:

   $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
                        /sys/kernel/slab/btrfs_extent_map/total_objects'

Shows the number of active and total extent maps skyrocketing to tens of
millions, and on systems with a short amount of memory it's easy and quick
to get into an OOM situation, as reported in that thread.

So to avoid this issue add a shrinker that will remove extents maps, as
long as they are not pinned, and takes proper care with any concurrent
fsync to avoid missing extents (setting the full sync flag while in the
middle of a fast fsync). This shrinker is triggered through the callbacks
nr_cached_objects and free_cached_objects of struct super_operations.

The shrinker will iterates over all roots and over all inodes of each
root, and keeps track of the last scanned root and inode, so that the
next time it runs, it starts from that root and from the next inode.
This is similar to what xfs does for its inode reclaim (implements those
callbacks, and cycles through inodes by starting from where it ended
last time).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c | 159 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/extent_map.h |   1 +
 fs/btrfs/fs.h         |   2 +
 fs/btrfs/super.c      |  17 +++++
 4 files changed, 179 insertions(+)

Message ID	4bfde54904b5a91a71eb0d86b9c78367865f93d8.1713267925.git.fdmanana@suse.com (mailing list archive)
State	New
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 683B812C559 for <linux-btrfs@vger.kernel.org>; Tue, 16 Apr 2024 13:08:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713272903; cv=none; b=NIoJl8OoZ90IUgfVTxDX+wti3N0AX1asDPV4C08Q87SFbBeC2ltJHOdo+8IahjE7E81CgfJ/ZQwfYnvOFDhq2DcUqFIwNgyEtzE1wXtloISdJSS9PxSFwlrbEzl82j1CQpQOEM3N0RSub8FdrD4zpLeD8++luUFlKMf+zX71wwo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713272903; c=relaxed/simple; bh=bpdAynDBPC6x2ztHBO/XyB4YARk/v6jttCB/E/J09kA=; h=From:To:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=l67MAoIkboqMMjRE91aV9ZscsAi/T0cczxgoMDxYA8HkHp3+xnJC79qF3kSZJB+xXN41NF0DSY8VBPTOWxZTwafYdm1kfMm/J6fwCH6KGsNrj8yxAWYhlCRJrO8kO7SgNburj6FxaKrzXt230D3xMQlsac93xOUkKCd0pb32M/g= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Wbl6xwVF; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Wbl6xwVF" Received: by smtp.kernel.org (Postfix) with ESMTPSA id BF1B1C2BD10 for <linux-btrfs@vger.kernel.org>; Tue, 16 Apr 2024 13:08:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1713272903; bh=bpdAynDBPC6x2ztHBO/XyB4YARk/v6jttCB/E/J09kA=; h=From:To:Subject:Date:In-Reply-To:References:From; b=Wbl6xwVFeRmD1ww0vMqEAQIP3jdOgmP2w4o/47XwDmM0ST05pcOPer6gqrPG2DFa8 MBpchkwoWivspmV1lmNm2Yl3NNxPHrgXcuFcCa58G3FMwfZF0Vxdkc9YTmvS4uJjOH vWnh5KNJVYDBBosmYhA+wOg+DcKo8Ly3f8Yj6UBA3iFf/sDDkuWLapJH0XeSr/GJKE ev2PVMLyM9GHzRNireMSbwpLx0ptY6RP3Aqp3vzbs+ELxmKaCJtjs1McwBwNEzY0IV Y8KbKbH8sCzPEj7POtEnKTbu5MtDbvJUiQLf5+0B8HUrqjl/iPpqnRER7VOBrnpGEO et6eDgO7H5zIQ== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH v3 08/10] btrfs: add a shrinker for extent maps Date: Tue, 16 Apr 2024 14:08:10 +0100 Message-Id: <4bfde54904b5a91a71eb0d86b9c78367865f93d8.1713267925.git.fdmanana@suse.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <cover.1713267925.git.fdmanana@suse.com> References: <cover.1713267925.git.fdmanana@suse.com> Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: <linux-btrfs.vger.kernel.org> List-Subscribe: <mailto:linux-btrfs+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-btrfs+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	btrfs: add a shrinker for extent maps \| expand [v3,00/10] btrfs: add a shrinker for extent maps [v3,01/10] btrfs: pass the extent map tree's inode to add_extent_mapping() [v3,02/10] btrfs: pass the extent map tree's inode to clear_em_logging() [v3,03/10] btrfs: pass the extent map tree's inode to remove_extent_mapping() [v3,04/10] btrfs: pass the extent map tree's inode to replace_extent_mapping() [v3,05/10] btrfs: pass the extent map tree's inode to setup_extent_mapping() [v3,06/10] btrfs: pass the extent map tree's inode to try_merge_map() [v3,07/10] btrfs: add a global per cpu counter to track number of used extent maps [v3,08/10] btrfs: add a shrinker for extent maps [v3,09/10] btrfs: update comment for btrfs_set_inode_full_sync() about locking [v3,10/10] btrfs: add tracepoints for extent map shrinker events

[v3,08/10] btrfs: add a shrinker for extent maps

Commit Message

Comments

Patch