Message ID | 5d1743b20f84e0262a2c229cd5e877ed0f0596a0.1712748143.git.fdmanana@suse.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | btrfs: add a shrinker for extent maps | expand |
在 2024/4/10 20:58, fdmanana@kernel.org 写道: > From: Filipe Manana <fdmanana@suse.com> > > Extent maps are used either to represent existing file extent items, or to > represent new extents that are going to be written and the respective file > extent items are created when the ordered extent completes. > > We currently don't have any limit for how many extent maps we can have, > neither per inode nor globally. Most of the time this not too noticeable > because extent maps are removed in the following situations: > > 1) When evicting an inode; > > 2) When releasing folios (pages) through the btrfs_release_folio() address > space operation callback. > > However we won't release extent maps in the folio range if the folio is > either dirty or under writeback or if the inode's i_size is less than > or equals to 16M (see try_release_extent_mapping(). This 16M i_size > constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: > extent_io and extent_state optimizations"), but there's no explanation > about why we have it or why the 16M value. > > This means that for buffered IO we can reach an OOM situation due to too > many extent maps if either of the following happens: > > 1) There's a set of tasks constantly doing IO on many files with a size > not larger than 16M, specially if they keep the files open for very > long periods, therefore preventing inode eviction. > > This requires a really high number of such files, and having many non > mergeable extent maps (due to random 4K writes for example) and a > machine with very little memory; > > 2) There's a set tasks constantly doing random write IO (therefore > creating many non mergeable extent maps) on files and keeping them > open for long periods of time, so inode eviction doesn't happen and > there's always a lot of dirty pages or pages under writeback, > preventing btrfs_release_folio() from releasing the respective extent > maps. > > This second case was actually reported in the thread pointed by the Link > tag below, and it requires a very large file under heavy IO and a machine > with very little amount of RAM, which is probably hard to happen in > practice in a real world use case. > > However when using direct IO this is not so hard to happen, because the > page cache is not used, and therefore btrfs_release_folio() is never > called. Which means extent maps are dropped only when evicting the inode, > and that means that if we have tasks that keep a file descriptor open and > keep doing IO on a very large file (or files), we can exhaust memory due > to an unbounded amount of extent maps. This is especially easy to happen > if we have a huge file with millions of small extents and their extent > maps are not mergeable (non contiguous offsets and disk locations). > This was reported in that thread with the following fio test: > > $ cat test.sh > #!/bin/bash > > DEV=/dev/sdj > MNT=/mnt/sdj > MOUNT_OPTIONS="-o ssd" > MKFS_OPTIONS="" > > cat <<EOF > /tmp/fio-job.ini > [global] > name=fio-rand-write > filename=$MNT/fio-rand-write > rw=randwrite > bs=4K > direct=1 > numjobs=16 > fallocate=none > time_based > runtime=90000 > > [file1] > size=300G > ioengine=libaio > iodepth=16 > > EOF > > umount $MNT &> /dev/null > mkfs.btrfs -f $MKFS_OPTIONS $DEV > mount $MOUNT_OPTIONS $DEV $MNT > > fio /tmp/fio-job.ini > umount $MNT > > Monitoring the btrfs_extent_map slab while running the test with: > > $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ > /sys/kernel/slab/btrfs_extent_map/total_objects' > > Shows the number of active and total extent maps skyrocketing to tens of > millions, and on systems with a short amount of memory it's easy and quick > to get into an OOM situation, as reported in that thread. > > So to avoid this issue add a shrinker that will remove extents maps, as > long as they are not pinned, and takes proper care with any concurrent > fsync to avoid missing extents (setting the full sync flag while in the > middle of a fast fsync). This shrinker is similar to the one ext4 uses > for its extent_status structure, which is analogous to btrfs' extent_map > structure. > > Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/ > Signed-off-by: Filipe Manana <fdmanana@suse.com> [...] > + > +static unsigned long btrfs_scan_root(struct btrfs_root *root, > + unsigned long *scanned, > + unsigned long nr_to_scan) > +{ > + unsigned long nr_dropped = 0; > + u64 ino = 0; > + > + while (*scanned < nr_to_scan) { > + struct rb_node *node; > + struct rb_node *prev = NULL; > + struct btrfs_inode *inode; > + bool stop_search = true; > + > + spin_lock(&root->inode_lock); > + node = root->inode_tree.rb_node; > + > + while (node) { > + prev = node; > + inode = rb_entry(node, struct btrfs_inode, rb_node); > + if (ino < btrfs_ino(inode)) > + node = node->rb_left; > + else if (ino > btrfs_ino(inode)) > + node = node->rb_right; > + else > + break; > + } > + > + if (!node) { > + while (prev) { > + inode = rb_entry(prev, struct btrfs_inode, rb_node); > + if (ino <= btrfs_ino(inode)) { > + node = prev; > + break; > + } > + prev = rb_next(prev); > + } > + } The "while (node) {}" loop and above "if (!node) {}" is to locate the first inode after @ino (which is the last scanned inode number). Maybe extract them into a helper, with some name like "find_next_inode_to_scan()" could be a little easier to read? > + > + while (node) { > + inode = rb_entry(node, struct btrfs_inode, rb_node); > + ino = btrfs_ino(inode) + 1; > + if (igrab(&inode->vfs_inode)) { > + spin_unlock(&root->inode_lock); > + stop_search = false; > + > + nr_dropped += btrfs_scan_inode(inode, scanned, > + nr_to_scan); > + iput(&inode->vfs_inode); > + cond_resched(); > + break; > + } > + node = rb_next(node); > + } > + > + if (stop_search) { > + spin_unlock(&root->inode_lock); > + break; > + } > + } > + > + return nr_dropped; > +} > + > +static unsigned long btrfs_extent_maps_scan(struct shrinker *shrinker, > + struct shrink_control *sc) > +{ > + struct btrfs_fs_info *fs_info = shrinker->private_data; > + unsigned long nr_dropped = 0; > + unsigned long scanned = 0; > + u64 next_root_id = 0; > + > + while (scanned < sc->nr_to_scan) { > + struct btrfs_root *root; > + unsigned long count; > + > + spin_lock(&fs_info->fs_roots_radix_lock); > + count = radix_tree_gang_lookup(&fs_info->fs_roots_radix, > + (void **)&root, next_root_id, 1); > + if (count == 0) { > + spin_unlock(&fs_info->fs_roots_radix_lock); > + break; > + } > + next_root_id = btrfs_root_id(root) + 1; > + root = btrfs_grab_root(root); > + spin_unlock(&fs_info->fs_roots_radix_lock); > + > + if (!root) > + continue; > + > + if (is_fstree(btrfs_root_id(root))) > + nr_dropped += btrfs_scan_root(root, &scanned, sc->nr_to_scan); > + > + btrfs_put_root(root); > + } > + > + return nr_dropped; > +} > + > +static unsigned long btrfs_extent_maps_count(struct shrinker *shrinker, > + struct shrink_control *sc) > +{ > + struct btrfs_fs_info *fs_info = shrinker->private_data; > + const s64 total = percpu_counter_sum_positive(&fs_info->evictable_extent_maps); > + > + /* The unsigned long type is 32 bits on 32 bits platforms. */ > +#if BITS_PER_LONG == 32 > + if (total > ULONG_MAX) > + return ULONG_MAX; > +#endif Can this be a simple min_t(unsigned long, total, ULONG_MAX)? Another question is, since total is s64, wouldn't any negative number go ULONG_MAX directly for 32bit systems? And since the function is just a shrink hook, I'm not sure what would happen if we return ULONG_MAX for negative values. Otherwise the idea looks pretty good, it's just me not qualified to give a good review. Thanks, Qu > + return total; > +} > + > +int btrfs_register_extent_map_shrinker(struct btrfs_fs_info *fs_info) > +{ > + int ret; > + > + ret = percpu_counter_init(&fs_info->evictable_extent_maps, 0, GFP_KERNEL); > + if (ret) > + return ret; > + > + fs_info->extent_map_shrinker = shrinker_alloc(0, "em-btrfs:%s", fs_info->sb->s_id); > + if (!fs_info->extent_map_shrinker) { > + percpu_counter_destroy(&fs_info->evictable_extent_maps); > + return -ENOMEM; > + } > + > + fs_info->extent_map_shrinker->scan_objects = btrfs_extent_maps_scan; > + fs_info->extent_map_shrinker->count_objects = btrfs_extent_maps_count; > + fs_info->extent_map_shrinker->private_data = fs_info; > + > + shrinker_register(fs_info->extent_map_shrinker); > + > + return 0; > +} > + > +void btrfs_unregister_extent_map_shrinker(struct btrfs_fs_info *fs_info) > +{ > + shrinker_free(fs_info->extent_map_shrinker); > + ASSERT(percpu_counter_sum_positive(&fs_info->evictable_extent_maps) == 0); > + percpu_counter_destroy(&fs_info->evictable_extent_maps); > +} > diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h > index c3707461ff62..8a6be2f7a0e2 100644 > --- a/fs/btrfs/extent_map.h > +++ b/fs/btrfs/extent_map.h > @@ -140,5 +140,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, > int btrfs_replace_extent_map_range(struct btrfs_inode *inode, > struct extent_map *new_em, > bool modified); > +int btrfs_register_extent_map_shrinker(struct btrfs_fs_info *fs_info); > +void btrfs_unregister_extent_map_shrinker(struct btrfs_fs_info *fs_info); > > #endif > diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h > index 534d30dafe32..f1414814bd69 100644 > --- a/fs/btrfs/fs.h > +++ b/fs/btrfs/fs.h > @@ -857,6 +857,8 @@ struct btrfs_fs_info { > struct lockdep_map btrfs_trans_pending_ordered_map; > struct lockdep_map btrfs_ordered_extent_map; > > + struct shrinker *extent_map_shrinker; > + > #ifdef CONFIG_BTRFS_FS_REF_VERIFY > spinlock_t ref_verify_lock; > struct rb_root block_tree;
On Thu, Apr 11, 2024 at 6:58 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > 在 2024/4/10 20:58, fdmanana@kernel.org 写道: > > From: Filipe Manana <fdmanana@suse.com> > > > > Extent maps are used either to represent existing file extent items, or to > > represent new extents that are going to be written and the respective file > > extent items are created when the ordered extent completes. > > > > We currently don't have any limit for how many extent maps we can have, > > neither per inode nor globally. Most of the time this not too noticeable > > because extent maps are removed in the following situations: > > > > 1) When evicting an inode; > > > > 2) When releasing folios (pages) through the btrfs_release_folio() address > > space operation callback. > > > > However we won't release extent maps in the folio range if the folio is > > either dirty or under writeback or if the inode's i_size is less than > > or equals to 16M (see try_release_extent_mapping(). This 16M i_size > > constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: > > extent_io and extent_state optimizations"), but there's no explanation > > about why we have it or why the 16M value. > > > > This means that for buffered IO we can reach an OOM situation due to too > > many extent maps if either of the following happens: > > > > 1) There's a set of tasks constantly doing IO on many files with a size > > not larger than 16M, specially if they keep the files open for very > > long periods, therefore preventing inode eviction. > > > > This requires a really high number of such files, and having many non > > mergeable extent maps (due to random 4K writes for example) and a > > machine with very little memory; > > > > 2) There's a set tasks constantly doing random write IO (therefore > > creating many non mergeable extent maps) on files and keeping them > > open for long periods of time, so inode eviction doesn't happen and > > there's always a lot of dirty pages or pages under writeback, > > preventing btrfs_release_folio() from releasing the respective extent > > maps. > > > > This second case was actually reported in the thread pointed by the Link > > tag below, and it requires a very large file under heavy IO and a machine > > with very little amount of RAM, which is probably hard to happen in > > practice in a real world use case. > > > > However when using direct IO this is not so hard to happen, because the > > page cache is not used, and therefore btrfs_release_folio() is never > > called. Which means extent maps are dropped only when evicting the inode, > > and that means that if we have tasks that keep a file descriptor open and > > keep doing IO on a very large file (or files), we can exhaust memory due > > to an unbounded amount of extent maps. This is especially easy to happen > > if we have a huge file with millions of small extents and their extent > > maps are not mergeable (non contiguous offsets and disk locations). > > This was reported in that thread with the following fio test: > > > > $ cat test.sh > > #!/bin/bash > > > > DEV=/dev/sdj > > MNT=/mnt/sdj > > MOUNT_OPTIONS="-o ssd" > > MKFS_OPTIONS="" > > > > cat <<EOF > /tmp/fio-job.ini > > [global] > > name=fio-rand-write > > filename=$MNT/fio-rand-write > > rw=randwrite > > bs=4K > > direct=1 > > numjobs=16 > > fallocate=none > > time_based > > runtime=90000 > > > > [file1] > > size=300G > > ioengine=libaio > > iodepth=16 > > > > EOF > > > > umount $MNT &> /dev/null > > mkfs.btrfs -f $MKFS_OPTIONS $DEV > > mount $MOUNT_OPTIONS $DEV $MNT > > > > fio /tmp/fio-job.ini > > umount $MNT > > > > Monitoring the btrfs_extent_map slab while running the test with: > > > > $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ > > /sys/kernel/slab/btrfs_extent_map/total_objects' > > > > Shows the number of active and total extent maps skyrocketing to tens of > > millions, and on systems with a short amount of memory it's easy and quick > > to get into an OOM situation, as reported in that thread. > > > > So to avoid this issue add a shrinker that will remove extents maps, as > > long as they are not pinned, and takes proper care with any concurrent > > fsync to avoid missing extents (setting the full sync flag while in the > > middle of a fast fsync). This shrinker is similar to the one ext4 uses > > for its extent_status structure, which is analogous to btrfs' extent_map > > structure. > > > > Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/ > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > [...] > > + > > +static unsigned long btrfs_scan_root(struct btrfs_root *root, > > + unsigned long *scanned, > > + unsigned long nr_to_scan) > > +{ > > + unsigned long nr_dropped = 0; > > + u64 ino = 0; > > + > > + while (*scanned < nr_to_scan) { > > + struct rb_node *node; > > + struct rb_node *prev = NULL; > > + struct btrfs_inode *inode; > > + bool stop_search = true; > > + > > + spin_lock(&root->inode_lock); > > + node = root->inode_tree.rb_node; > > + > > + while (node) { > > + prev = node; > > + inode = rb_entry(node, struct btrfs_inode, rb_node); > > + if (ino < btrfs_ino(inode)) > > + node = node->rb_left; > > + else if (ino > btrfs_ino(inode)) > > + node = node->rb_right; > > + else > > + break; > > + } > > + > > + if (!node) { > > + while (prev) { > > + inode = rb_entry(prev, struct btrfs_inode, rb_node); > > + if (ino <= btrfs_ino(inode)) { > > + node = prev; > > + break; > > + } > > + prev = rb_next(prev); > > + } > > + } > > The "while (node) {}" loop and above "if (!node) {}" is to locate the > first inode after @ino (which is the last scanned inode number). > > Maybe extract them into a helper, with some name like > "find_next_inode_to_scan()" could be a little easier to read? Sure, I can do that. > > > + > > + while (node) { > > + inode = rb_entry(node, struct btrfs_inode, rb_node); > > + ino = btrfs_ino(inode) + 1; > > + if (igrab(&inode->vfs_inode)) { > > + spin_unlock(&root->inode_lock); > > + stop_search = false; > > + > > + nr_dropped += btrfs_scan_inode(inode, scanned, > > + nr_to_scan); > > + iput(&inode->vfs_inode); > > + cond_resched(); > > + break; > > + } > > + node = rb_next(node); > > + } > > + > > + if (stop_search) { > > + spin_unlock(&root->inode_lock); > > + break; > > + } > > + } > > + > > + return nr_dropped; > > +} > > + > > +static unsigned long btrfs_extent_maps_scan(struct shrinker *shrinker, > > + struct shrink_control *sc) > > +{ > > + struct btrfs_fs_info *fs_info = shrinker->private_data; > > + unsigned long nr_dropped = 0; > > + unsigned long scanned = 0; > > + u64 next_root_id = 0; > > + > > + while (scanned < sc->nr_to_scan) { > > + struct btrfs_root *root; > > + unsigned long count; > > + > > + spin_lock(&fs_info->fs_roots_radix_lock); > > + count = radix_tree_gang_lookup(&fs_info->fs_roots_radix, > > + (void **)&root, next_root_id, 1); > > + if (count == 0) { > > + spin_unlock(&fs_info->fs_roots_radix_lock); > > + break; > > + } > > + next_root_id = btrfs_root_id(root) + 1; > > + root = btrfs_grab_root(root); > > + spin_unlock(&fs_info->fs_roots_radix_lock); > > + > > + if (!root) > > + continue; > > + > > + if (is_fstree(btrfs_root_id(root))) > > + nr_dropped += btrfs_scan_root(root, &scanned, sc->nr_to_scan); > > + > > + btrfs_put_root(root); > > + } > > + > > + return nr_dropped; > > +} > > + > > +static unsigned long btrfs_extent_maps_count(struct shrinker *shrinker, > > + struct shrink_control *sc) > > +{ > > + struct btrfs_fs_info *fs_info = shrinker->private_data; > > + const s64 total = percpu_counter_sum_positive(&fs_info->evictable_extent_maps); > > + > > + /* The unsigned long type is 32 bits on 32 bits platforms. */ > > +#if BITS_PER_LONG == 32 > > + if (total > ULONG_MAX) > > + return ULONG_MAX; > > +#endif > > Can this be a simple min_t(unsigned long, total, ULONG_MAX)? Why? Either the current if or min should be easy to understand for anyone. I'm actually removing this altogether, as on a 32 bits systems we can't have more than ULONG_MAX extent maps allocated any way, even if they could be 1 byte long. Ext4 ignores that in its shrinker for its extent_status structure for example, which makes sense for that same reason. > > Another question is, since total is s64, wouldn't any negative number go > ULONG_MAX directly for 32bit systems? How could you have a negative number? percpu_counter_sum_positive() guarantees a non-negative number is returned. > > And since the function is just a shrink hook, I'm not sure what would > happen if we return ULONG_MAX for negative values. Nothing. When scanning we stop iteration once we don't find more roots/inodes/extent maps. > > Otherwise the idea looks pretty good, it's just me not qualified to give > a good review. > > Thanks, > Qu > > + return total; > > +} > > + > > +int btrfs_register_extent_map_shrinker(struct btrfs_fs_info *fs_info) > > +{ > > + int ret; > > + > > + ret = percpu_counter_init(&fs_info->evictable_extent_maps, 0, GFP_KERNEL); > > + if (ret) > > + return ret; > > + > > + fs_info->extent_map_shrinker = shrinker_alloc(0, "em-btrfs:%s", fs_info->sb->s_id); > > + if (!fs_info->extent_map_shrinker) { > > + percpu_counter_destroy(&fs_info->evictable_extent_maps); > > + return -ENOMEM; > > + } > > + > > + fs_info->extent_map_shrinker->scan_objects = btrfs_extent_maps_scan; > > + fs_info->extent_map_shrinker->count_objects = btrfs_extent_maps_count; > > + fs_info->extent_map_shrinker->private_data = fs_info; > > + > > + shrinker_register(fs_info->extent_map_shrinker); > > + > > + return 0; > > +} > > + > > +void btrfs_unregister_extent_map_shrinker(struct btrfs_fs_info *fs_info) > > +{ > > + shrinker_free(fs_info->extent_map_shrinker); > > + ASSERT(percpu_counter_sum_positive(&fs_info->evictable_extent_maps) == 0); > > + percpu_counter_destroy(&fs_info->evictable_extent_maps); > > +} > > diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h > > index c3707461ff62..8a6be2f7a0e2 100644 > > --- a/fs/btrfs/extent_map.h > > +++ b/fs/btrfs/extent_map.h > > @@ -140,5 +140,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, > > int btrfs_replace_extent_map_range(struct btrfs_inode *inode, > > struct extent_map *new_em, > > bool modified); > > +int btrfs_register_extent_map_shrinker(struct btrfs_fs_info *fs_info); > > +void btrfs_unregister_extent_map_shrinker(struct btrfs_fs_info *fs_info); > > > > #endif > > diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h > > index 534d30dafe32..f1414814bd69 100644 > > --- a/fs/btrfs/fs.h > > +++ b/fs/btrfs/fs.h > > @@ -857,6 +857,8 @@ struct btrfs_fs_info { > > struct lockdep_map btrfs_trans_pending_ordered_map; > > struct lockdep_map btrfs_ordered_extent_map; > > > > + struct shrinker *extent_map_shrinker; > > + > > #ifdef CONFIG_BTRFS_FS_REF_VERIFY > > spinlock_t ref_verify_lock; > > struct rb_root block_tree;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 3c2d35b2062e..8bb295eaf3d7 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1266,11 +1266,10 @@ static void free_global_roots(struct btrfs_fs_info *fs_info) void btrfs_free_fs_info(struct btrfs_fs_info *fs_info) { + btrfs_unregister_extent_map_shrinker(fs_info); percpu_counter_destroy(&fs_info->dirty_metadata_bytes); percpu_counter_destroy(&fs_info->delalloc_bytes); percpu_counter_destroy(&fs_info->ordered_bytes); - ASSERT(percpu_counter_sum_positive(&fs_info->evictable_extent_maps) == 0); - percpu_counter_destroy(&fs_info->evictable_extent_maps); percpu_counter_destroy(&fs_info->dev_replace.bio_counter); btrfs_free_csum_hash(fs_info); btrfs_free_stripe_hash_table(fs_info); @@ -2846,11 +2845,11 @@ static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block sb->s_blocksize = BTRFS_BDEV_BLOCKSIZE; sb->s_blocksize_bits = blksize_bits(BTRFS_BDEV_BLOCKSIZE); - ret = percpu_counter_init(&fs_info->ordered_bytes, 0, GFP_KERNEL); + ret = btrfs_register_extent_map_shrinker(fs_info); if (ret) return ret; - ret = percpu_counter_init(&fs_info->evictable_extent_maps, 0, GFP_KERNEL); + ret = percpu_counter_init(&fs_info->ordered_bytes, 0, GFP_KERNEL); if (ret) return ret; diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c index 2fcf28148a81..fa755921442d 100644 --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -8,6 +8,7 @@ #include "extent_map.h" #include "compression.h" #include "btrfs_inode.h" +#include "disk-io.h" static struct kmem_cache *extent_map_cache; @@ -1026,3 +1027,202 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre, free_extent_map(split_pre); return ret; } + +static unsigned long btrfs_scan_inode(struct btrfs_inode *inode, + unsigned long *scanned, + unsigned long nr_to_scan) +{ + struct extent_map_tree *tree = &inode->extent_tree; + unsigned long nr_dropped = 0; + struct rb_node *node; + + /* + * Take the mmap lock so that we serialize with the inode logging phase + * of fsync because we may need to set the full sync flag on the inode, + * in case we have to remove extent maps in the tree's list of modified + * extents. If we set the full sync flag in the inode while an fsync is + * in progress, we may risk missing new extents because before the flag + * is set, fsync decides to only wait for writeback to complete and then + * during inode logging it sees the flag set and uses the subvolume tree + * to find new extents, which may not be there yet because ordered + * extents haven't completed yet. + */ + down_read(&inode->i_mmap_lock); + write_lock(&tree->lock); + node = rb_first_cached(&tree->map); + while (node) { + struct extent_map *em; + + em = rb_entry(node, struct extent_map, rb_node); + node = rb_next(node); + (*scanned)++; + + if (em->flags & EXTENT_FLAG_PINNED) + goto next; + + if (!list_empty(&em->list)) + btrfs_set_inode_full_sync(inode); + + remove_extent_mapping(inode, em); + /* Drop the reference for the tree. */ + free_extent_map(em); + nr_dropped++; +next: + if (*scanned >= nr_to_scan) + break; + + /* + * Restart if we had to resched, and any extent maps that were + * pinned before may have become unpinned after we released the + * lock and took it again. + */ + if (cond_resched_rwlock_write(&tree->lock)) + node = rb_first_cached(&tree->map); + } + write_unlock(&tree->lock); + up_read(&inode->i_mmap_lock); + + return nr_dropped; +} + +static unsigned long btrfs_scan_root(struct btrfs_root *root, + unsigned long *scanned, + unsigned long nr_to_scan) +{ + unsigned long nr_dropped = 0; + u64 ino = 0; + + while (*scanned < nr_to_scan) { + struct rb_node *node; + struct rb_node *prev = NULL; + struct btrfs_inode *inode; + bool stop_search = true; + + spin_lock(&root->inode_lock); + node = root->inode_tree.rb_node; + + while (node) { + prev = node; + inode = rb_entry(node, struct btrfs_inode, rb_node); + if (ino < btrfs_ino(inode)) + node = node->rb_left; + else if (ino > btrfs_ino(inode)) + node = node->rb_right; + else + break; + } + + if (!node) { + while (prev) { + inode = rb_entry(prev, struct btrfs_inode, rb_node); + if (ino <= btrfs_ino(inode)) { + node = prev; + break; + } + prev = rb_next(prev); + } + } + + while (node) { + inode = rb_entry(node, struct btrfs_inode, rb_node); + ino = btrfs_ino(inode) + 1; + if (igrab(&inode->vfs_inode)) { + spin_unlock(&root->inode_lock); + stop_search = false; + + nr_dropped += btrfs_scan_inode(inode, scanned, + nr_to_scan); + iput(&inode->vfs_inode); + cond_resched(); + break; + } + node = rb_next(node); + } + + if (stop_search) { + spin_unlock(&root->inode_lock); + break; + } + } + + return nr_dropped; +} + +static unsigned long btrfs_extent_maps_scan(struct shrinker *shrinker, + struct shrink_control *sc) +{ + struct btrfs_fs_info *fs_info = shrinker->private_data; + unsigned long nr_dropped = 0; + unsigned long scanned = 0; + u64 next_root_id = 0; + + while (scanned < sc->nr_to_scan) { + struct btrfs_root *root; + unsigned long count; + + spin_lock(&fs_info->fs_roots_radix_lock); + count = radix_tree_gang_lookup(&fs_info->fs_roots_radix, + (void **)&root, next_root_id, 1); + if (count == 0) { + spin_unlock(&fs_info->fs_roots_radix_lock); + break; + } + next_root_id = btrfs_root_id(root) + 1; + root = btrfs_grab_root(root); + spin_unlock(&fs_info->fs_roots_radix_lock); + + if (!root) + continue; + + if (is_fstree(btrfs_root_id(root))) + nr_dropped += btrfs_scan_root(root, &scanned, sc->nr_to_scan); + + btrfs_put_root(root); + } + + return nr_dropped; +} + +static unsigned long btrfs_extent_maps_count(struct shrinker *shrinker, + struct shrink_control *sc) +{ + struct btrfs_fs_info *fs_info = shrinker->private_data; + const s64 total = percpu_counter_sum_positive(&fs_info->evictable_extent_maps); + + /* The unsigned long type is 32 bits on 32 bits platforms. */ +#if BITS_PER_LONG == 32 + if (total > ULONG_MAX) + return ULONG_MAX; +#endif + return total; +} + +int btrfs_register_extent_map_shrinker(struct btrfs_fs_info *fs_info) +{ + int ret; + + ret = percpu_counter_init(&fs_info->evictable_extent_maps, 0, GFP_KERNEL); + if (ret) + return ret; + + fs_info->extent_map_shrinker = shrinker_alloc(0, "em-btrfs:%s", fs_info->sb->s_id); + if (!fs_info->extent_map_shrinker) { + percpu_counter_destroy(&fs_info->evictable_extent_maps); + return -ENOMEM; + } + + fs_info->extent_map_shrinker->scan_objects = btrfs_extent_maps_scan; + fs_info->extent_map_shrinker->count_objects = btrfs_extent_maps_count; + fs_info->extent_map_shrinker->private_data = fs_info; + + shrinker_register(fs_info->extent_map_shrinker); + + return 0; +} + +void btrfs_unregister_extent_map_shrinker(struct btrfs_fs_info *fs_info) +{ + shrinker_free(fs_info->extent_map_shrinker); + ASSERT(percpu_counter_sum_positive(&fs_info->evictable_extent_maps) == 0); + percpu_counter_destroy(&fs_info->evictable_extent_maps); +} diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h index c3707461ff62..8a6be2f7a0e2 100644 --- a/fs/btrfs/extent_map.h +++ b/fs/btrfs/extent_map.h @@ -140,5 +140,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, int btrfs_replace_extent_map_range(struct btrfs_inode *inode, struct extent_map *new_em, bool modified); +int btrfs_register_extent_map_shrinker(struct btrfs_fs_info *fs_info); +void btrfs_unregister_extent_map_shrinker(struct btrfs_fs_info *fs_info); #endif diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h index 534d30dafe32..f1414814bd69 100644 --- a/fs/btrfs/fs.h +++ b/fs/btrfs/fs.h @@ -857,6 +857,8 @@ struct btrfs_fs_info { struct lockdep_map btrfs_trans_pending_ordered_map; struct lockdep_map btrfs_ordered_extent_map; + struct shrinker *extent_map_shrinker; + #ifdef CONFIG_BTRFS_FS_REF_VERIFY spinlock_t ref_verify_lock; struct rb_root block_tree;