diff mbox series

[v2,13/15] btrfs: add a shrinker for extent maps

Message ID 1cb649870b6cad4411da7998735ab1141bb9f2f0.1712837044.git.fdmanana@suse.com (mailing list archive)
State New
Headers show
Series btrfs: add a shrinker for extent maps | expand

Commit Message

Filipe Manana April 11, 2024, 4:19 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

Extent maps are used either to represent existing file extent items, or to
represent new extents that are going to be written and the respective file
extent items are created when the ordered extent completes.

We currently don't have any limit for how many extent maps we can have,
neither per inode nor globally. Most of the time this not too noticeable
because extent maps are removed in the following situations:

1) When evicting an inode;

2) When releasing folios (pages) through the btrfs_release_folio() address
   space operation callback.

   However we won't release extent maps in the folio range if the folio is
   either dirty or under writeback or if the inode's i_size is less than
   or equals to 16M (see try_release_extent_mapping(). This 16M i_size
   constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
   extent_io and extent_state optimizations"), but there's no explanation
   about why we have it or why the 16M value.

This means that for buffered IO we can reach an OOM situation due to too
many extent maps if either of the following happens:

1) There's a set of tasks constantly doing IO on many files with a size
   not larger than 16M, specially if they keep the files open for very
   long periods, therefore preventing inode eviction.

   This requires a really high number of such files, and having many non
   mergeable extent maps (due to random 4K writes for example) and a
   machine with very little memory;

2) There's a set tasks constantly doing random write IO (therefore
   creating many non mergeable extent maps) on files and keeping them
   open for long periods of time, so inode eviction doesn't happen and
   there's always a lot of dirty pages or pages under writeback,
   preventing btrfs_release_folio() from releasing the respective extent
   maps.

This second case was actually reported in the thread pointed by the Link
tag below, and it requires a very large file under heavy IO and a machine
with very little amount of RAM, which is probably hard to happen in
practice in a real world use case.

However when using direct IO this is not so hard to happen, because the
page cache is not used, and therefore btrfs_release_folio() is never
called. Which means extent maps are dropped only when evicting the inode,
and that means that if we have tasks that keep a file descriptor open and
keep doing IO on a very large file (or files), we can exhaust memory due
to an unbounded amount of extent maps. This is especially easy to happen
if we have a huge file with millions of small extents and their extent
maps are not mergeable (non contiguous offsets and disk locations).
This was reported in that thread with the following fio test:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/sdj
   MNT=/mnt/sdj
   MOUNT_OPTIONS="-o ssd"
   MKFS_OPTIONS=""

   cat <<EOF > /tmp/fio-job.ini
   [global]
   name=fio-rand-write
   filename=$MNT/fio-rand-write
   rw=randwrite
   bs=4K
   direct=1
   numjobs=16
   fallocate=none
   time_based
   runtime=90000

   [file1]
   size=300G
   ioengine=libaio
   iodepth=16

   EOF

   umount $MNT &> /dev/null
   mkfs.btrfs -f $MKFS_OPTIONS $DEV
   mount $MOUNT_OPTIONS $DEV $MNT

   fio /tmp/fio-job.ini
   umount $MNT

Monitoring the btrfs_extent_map slab while running the test with:

   $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
                        /sys/kernel/slab/btrfs_extent_map/total_objects'

Shows the number of active and total extent maps skyrocketing to tens of
millions, and on systems with a short amount of memory it's easy and quick
to get into an OOM situation, as reported in that thread.

So to avoid this issue add a shrinker that will remove extents maps, as
long as they are not pinned, and takes proper care with any concurrent
fsync to avoid missing extents (setting the full sync flag while in the
middle of a fast fsync). This shrinker is similar to the one ext4 uses
for its extent_status structure, which is analogous to btrfs' extent_map
structure.

Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/disk-io.c    |   7 +-
 fs/btrfs/extent_map.c | 156 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/extent_map.h |   2 +
 fs/btrfs/fs.h         |   2 +
 4 files changed, 163 insertions(+), 4 deletions(-)

Comments

Josef Bacik April 12, 2024, 8:06 p.m. UTC | #1
On Thu, Apr 11, 2024 at 05:19:07PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Extent maps are used either to represent existing file extent items, or to
> represent new extents that are going to be written and the respective file
> extent items are created when the ordered extent completes.
> 
> We currently don't have any limit for how many extent maps we can have,
> neither per inode nor globally. Most of the time this not too noticeable
> because extent maps are removed in the following situations:
> 
> 1) When evicting an inode;
> 
> 2) When releasing folios (pages) through the btrfs_release_folio() address
>    space operation callback.
> 
>    However we won't release extent maps in the folio range if the folio is
>    either dirty or under writeback or if the inode's i_size is less than
>    or equals to 16M (see try_release_extent_mapping(). This 16M i_size
>    constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
>    extent_io and extent_state optimizations"), but there's no explanation
>    about why we have it or why the 16M value.
> 
> This means that for buffered IO we can reach an OOM situation due to too
> many extent maps if either of the following happens:
> 
> 1) There's a set of tasks constantly doing IO on many files with a size
>    not larger than 16M, specially if they keep the files open for very
>    long periods, therefore preventing inode eviction.
> 
>    This requires a really high number of such files, and having many non
>    mergeable extent maps (due to random 4K writes for example) and a
>    machine with very little memory;
> 
> 2) There's a set tasks constantly doing random write IO (therefore
>    creating many non mergeable extent maps) on files and keeping them
>    open for long periods of time, so inode eviction doesn't happen and
>    there's always a lot of dirty pages or pages under writeback,
>    preventing btrfs_release_folio() from releasing the respective extent
>    maps.
> 
> This second case was actually reported in the thread pointed by the Link
> tag below, and it requires a very large file under heavy IO and a machine
> with very little amount of RAM, which is probably hard to happen in
> practice in a real world use case.
> 
> However when using direct IO this is not so hard to happen, because the
> page cache is not used, and therefore btrfs_release_folio() is never
> called. Which means extent maps are dropped only when evicting the inode,
> and that means that if we have tasks that keep a file descriptor open and
> keep doing IO on a very large file (or files), we can exhaust memory due
> to an unbounded amount of extent maps. This is especially easy to happen
> if we have a huge file with millions of small extents and their extent
> maps are not mergeable (non contiguous offsets and disk locations).
> This was reported in that thread with the following fio test:
> 
>    $ cat test.sh
>    #!/bin/bash
> 
>    DEV=/dev/sdj
>    MNT=/mnt/sdj
>    MOUNT_OPTIONS="-o ssd"
>    MKFS_OPTIONS=""
> 
>    cat <<EOF > /tmp/fio-job.ini
>    [global]
>    name=fio-rand-write
>    filename=$MNT/fio-rand-write
>    rw=randwrite
>    bs=4K
>    direct=1
>    numjobs=16
>    fallocate=none
>    time_based
>    runtime=90000
> 
>    [file1]
>    size=300G
>    ioengine=libaio
>    iodepth=16
> 
>    EOF
> 
>    umount $MNT &> /dev/null
>    mkfs.btrfs -f $MKFS_OPTIONS $DEV
>    mount $MOUNT_OPTIONS $DEV $MNT
> 
>    fio /tmp/fio-job.ini
>    umount $MNT
> 
> Monitoring the btrfs_extent_map slab while running the test with:
> 
>    $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
>                         /sys/kernel/slab/btrfs_extent_map/total_objects'
> 
> Shows the number of active and total extent maps skyrocketing to tens of
> millions, and on systems with a short amount of memory it's easy and quick
> to get into an OOM situation, as reported in that thread.
> 
> So to avoid this issue add a shrinker that will remove extents maps, as
> long as they are not pinned, and takes proper care with any concurrent
> fsync to avoid missing extents (setting the full sync flag while in the
> middle of a fast fsync). This shrinker is similar to the one ext4 uses
> for its extent_status structure, which is analogous to btrfs' extent_map
> structure.
> 
> Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

I don't like this for a few reasons

1. We're always starting with the first root and the first inode.  We're just
   going to constantly screw that first inode over and over again.
2. I really, really hate our inode rb-tree, I want to reduce it's use, not add
   more users.  It would be nice if we could just utilize ->s_inodes_lru instead
   for this, which would also give us the nice advantage of not having to think
   about order since it's already in LRU order.
3. We're registering our own shrinker without a proper LRU setup.  I think it
   would make sense if we wanted to have a LRU for our extent maps, but I think
   that's not a great idea.  We could get the same benefit by adding our own
   ->nr_cached_objects() and ->free_cached_objects(), I think that's a better
   approach no matter what other changes you make instead of registering our own
   shrinker.

The concept I whole heartedly agree with, this just needs some tweaks to be more
fair and cleaner.  The rest of the code is fine, you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

to the rest of it.  Thanks,

Josef
Filipe Manana April 13, 2024, 11:07 a.m. UTC | #2
On Fri, Apr 12, 2024 at 9:07 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Thu, Apr 11, 2024 at 05:19:07PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Extent maps are used either to represent existing file extent items, or to
> > represent new extents that are going to be written and the respective file
> > extent items are created when the ordered extent completes.
> >
> > We currently don't have any limit for how many extent maps we can have,
> > neither per inode nor globally. Most of the time this not too noticeable
> > because extent maps are removed in the following situations:
> >
> > 1) When evicting an inode;
> >
> > 2) When releasing folios (pages) through the btrfs_release_folio() address
> >    space operation callback.
> >
> >    However we won't release extent maps in the folio range if the folio is
> >    either dirty or under writeback or if the inode's i_size is less than
> >    or equals to 16M (see try_release_extent_mapping(). This 16M i_size
> >    constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
> >    extent_io and extent_state optimizations"), but there's no explanation
> >    about why we have it or why the 16M value.
> >
> > This means that for buffered IO we can reach an OOM situation due to too
> > many extent maps if either of the following happens:
> >
> > 1) There's a set of tasks constantly doing IO on many files with a size
> >    not larger than 16M, specially if they keep the files open for very
> >    long periods, therefore preventing inode eviction.
> >
> >    This requires a really high number of such files, and having many non
> >    mergeable extent maps (due to random 4K writes for example) and a
> >    machine with very little memory;
> >
> > 2) There's a set tasks constantly doing random write IO (therefore
> >    creating many non mergeable extent maps) on files and keeping them
> >    open for long periods of time, so inode eviction doesn't happen and
> >    there's always a lot of dirty pages or pages under writeback,
> >    preventing btrfs_release_folio() from releasing the respective extent
> >    maps.
> >
> > This second case was actually reported in the thread pointed by the Link
> > tag below, and it requires a very large file under heavy IO and a machine
> > with very little amount of RAM, which is probably hard to happen in
> > practice in a real world use case.
> >
> > However when using direct IO this is not so hard to happen, because the
> > page cache is not used, and therefore btrfs_release_folio() is never
> > called. Which means extent maps are dropped only when evicting the inode,
> > and that means that if we have tasks that keep a file descriptor open and
> > keep doing IO on a very large file (or files), we can exhaust memory due
> > to an unbounded amount of extent maps. This is especially easy to happen
> > if we have a huge file with millions of small extents and their extent
> > maps are not mergeable (non contiguous offsets and disk locations).
> > This was reported in that thread with the following fio test:
> >
> >    $ cat test.sh
> >    #!/bin/bash
> >
> >    DEV=/dev/sdj
> >    MNT=/mnt/sdj
> >    MOUNT_OPTIONS="-o ssd"
> >    MKFS_OPTIONS=""
> >
> >    cat <<EOF > /tmp/fio-job.ini
> >    [global]
> >    name=fio-rand-write
> >    filename=$MNT/fio-rand-write
> >    rw=randwrite
> >    bs=4K
> >    direct=1
> >    numjobs=16
> >    fallocate=none
> >    time_based
> >    runtime=90000
> >
> >    [file1]
> >    size=300G
> >    ioengine=libaio
> >    iodepth=16
> >
> >    EOF
> >
> >    umount $MNT &> /dev/null
> >    mkfs.btrfs -f $MKFS_OPTIONS $DEV
> >    mount $MOUNT_OPTIONS $DEV $MNT
> >
> >    fio /tmp/fio-job.ini
> >    umount $MNT
> >
> > Monitoring the btrfs_extent_map slab while running the test with:
> >
> >    $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
> >                         /sys/kernel/slab/btrfs_extent_map/total_objects'
> >
> > Shows the number of active and total extent maps skyrocketing to tens of
> > millions, and on systems with a short amount of memory it's easy and quick
> > to get into an OOM situation, as reported in that thread.
> >
> > So to avoid this issue add a shrinker that will remove extents maps, as
> > long as they are not pinned, and takes proper care with any concurrent
> > fsync to avoid missing extents (setting the full sync flag while in the
> > middle of a fast fsync). This shrinker is similar to the one ext4 uses
> > for its extent_status structure, which is analogous to btrfs' extent_map
> > structure.
> >
> > Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> I don't like this for a few reasons
>
> 1. We're always starting with the first root and the first inode.  We're just
>    going to constantly screw that first inode over and over again.
> 2. I really, really hate our inode rb-tree, I want to reduce it's use, not add
>    more users.  It would be nice if we could just utilize ->s_inodes_lru instead
>    for this, which would also give us the nice advantage of not having to think
>    about order since it's already in LRU order.
> 3. We're registering our own shrinker without a proper LRU setup.  I think it
>    would make sense if we wanted to have a LRU for our extent maps, but I think
>    that's not a great idea.  We could get the same benefit by adding our own
>    ->nr_cached_objects() and ->free_cached_objects(), I think that's a better
>    approach no matter what other changes you make instead of registering our own
>    shrinker.

Ok, so some comments about all that.

Using sb->s_inode_lru, which has the struct list_lru type is complicated.
I had considered that some time ago, and here are the problems with it:

1) List iteration, done with list_lru_walk() (or one of its variants),
is done while holding the lru list's spinlock.

This means we can't remove all extents maps in one go as that will
take too much time if we have millions of them for example, and make
other tasks spin for too long on the lock.

We will also need to take the inode's mmap semaphore which is a
blocking operation, so we can't do it while under the spinlock.

Sure, the lru list iteration's callback (the "isolate" argument of
type list_lru_walk_cb) can release that lru list spinlock, do the
extent map iteration and removal, then lock it again and return
LRU_RETRY.
But that will restart the search from the first element in the lru
list. This means we can be iterating over the same inodes over and
over.

2) You may hate the inode rb-tree, but we don't have another way to
iterate over the inodes and be efficient to search for inodes starting
at a particular number.

3)  ->nr_cached_objects() and ->free_cached_objects() of the super
operations are a good alternative to registering our own shrinker.
I went with our own shrinker because it's what ext4 is doing and it's
simple. Changing to the super operations is fairly simple and I
embrace it.

4) That concern about LRU is not really that relevant as you think.

Look at fs/super.c:super_cache_scan() (for when we define our
->nr_cached_objects() and ->free_cached_objects() callbacks).

Every time ->free_cached_objects() is called we do it for a number of
items we returned with ->nr_cached_objects().
That means we all go over all inodes and so going in LRU order or any
other order ends up being irrelevant.

The same goes for the shrinker solution.

Sure you can argue that in the time between calling
->nr_cached_objects() and calling ->free_cached_objects() more extent
maps may have been created.
But that's not a big time window, not enough to add that many extent
maps to make any practical difference.
Plus when the shrinking is performed it's because we are under heavy
memory pressure and removing extent maps won't have that much impact
anyway, since page cache was freed amongst other things that have more
impact on performance than extent maps.

This is probably why ext4 is following the same approach for its
extent_status structures as I did here (well, I followed their way of
doing things).
Those structures are in a rb tree of the inode, just like we do with
our extent maps.
And the inodes are in a regular linked list (struct
ext4_sb_info::s_es_list) and added with list_add_tail to that list -
they're never moved within the list (no LRU).
When the scan callback of the shrinker is invoked, it always iterates
the inodes in that list order - not LRU or anything "fancy".


So I'm okay with moving to an implementation based on
->nr_cached_objects() and calling ->free_cached_objects(), that's
simple.
But iterating the inodes will have to be with the rb-tree, as we don't
have anything else that allows us to iterate and start at any given
number.
I can make it to remember the last scanned inode (and root) so that
the next time the shrinking happens it will start after that last
scanned inode.

And adding our own lru implementation wouldn't make much difference
because of all that, and would also have 2 downsides to it:

1) We would need a struct list_head  (16 bytes) plus a pointer to the
inode (so that we can remove an extent map from the rb tree) added to
the extent_map structure, so 24 bytes overhead.

2) All the maintenance of the list, adding to it, removing from it,
moving elements, etc, all requires locking and overhead from lock
contention (using the struct list_lru API or our own).

So my proposal is to do this:

1) Keep using the rb tree to search for inodes - this is abstracted by
an helper function used in 2 other places.
2) Remember the last scanned inode/root, and on every scan start after
that inode.
3) Use ->nr_cached_objects() and calling ->free_cached_objects()
instead of registering a shrinker.

What do you think?
Thanks.

>
> The concept I whole heartedly agree with, this just needs some tweaks to be more
> fair and cleaner.  The rest of the code is fine, you can add
>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>
> to the rest of it.  Thanks,
>
> Josef
Filipe Manana April 14, 2024, 10:38 a.m. UTC | #3
On Sat, Apr 13, 2024 at 12:07 PM Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Fri, Apr 12, 2024 at 9:07 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Thu, Apr 11, 2024 at 05:19:07PM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > Extent maps are used either to represent existing file extent items, or to
> > > represent new extents that are going to be written and the respective file
> > > extent items are created when the ordered extent completes.
> > >
> > > We currently don't have any limit for how many extent maps we can have,
> > > neither per inode nor globally. Most of the time this not too noticeable
> > > because extent maps are removed in the following situations:
> > >
> > > 1) When evicting an inode;
> > >
> > > 2) When releasing folios (pages) through the btrfs_release_folio() address
> > >    space operation callback.
> > >
> > >    However we won't release extent maps in the folio range if the folio is
> > >    either dirty or under writeback or if the inode's i_size is less than
> > >    or equals to 16M (see try_release_extent_mapping(). This 16M i_size
> > >    constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
> > >    extent_io and extent_state optimizations"), but there's no explanation
> > >    about why we have it or why the 16M value.
> > >
> > > This means that for buffered IO we can reach an OOM situation due to too
> > > many extent maps if either of the following happens:
> > >
> > > 1) There's a set of tasks constantly doing IO on many files with a size
> > >    not larger than 16M, specially if they keep the files open for very
> > >    long periods, therefore preventing inode eviction.
> > >
> > >    This requires a really high number of such files, and having many non
> > >    mergeable extent maps (due to random 4K writes for example) and a
> > >    machine with very little memory;
> > >
> > > 2) There's a set tasks constantly doing random write IO (therefore
> > >    creating many non mergeable extent maps) on files and keeping them
> > >    open for long periods of time, so inode eviction doesn't happen and
> > >    there's always a lot of dirty pages or pages under writeback,
> > >    preventing btrfs_release_folio() from releasing the respective extent
> > >    maps.
> > >
> > > This second case was actually reported in the thread pointed by the Link
> > > tag below, and it requires a very large file under heavy IO and a machine
> > > with very little amount of RAM, which is probably hard to happen in
> > > practice in a real world use case.
> > >
> > > However when using direct IO this is not so hard to happen, because the
> > > page cache is not used, and therefore btrfs_release_folio() is never
> > > called. Which means extent maps are dropped only when evicting the inode,
> > > and that means that if we have tasks that keep a file descriptor open and
> > > keep doing IO on a very large file (or files), we can exhaust memory due
> > > to an unbounded amount of extent maps. This is especially easy to happen
> > > if we have a huge file with millions of small extents and their extent
> > > maps are not mergeable (non contiguous offsets and disk locations).
> > > This was reported in that thread with the following fio test:
> > >
> > >    $ cat test.sh
> > >    #!/bin/bash
> > >
> > >    DEV=/dev/sdj
> > >    MNT=/mnt/sdj
> > >    MOUNT_OPTIONS="-o ssd"
> > >    MKFS_OPTIONS=""
> > >
> > >    cat <<EOF > /tmp/fio-job.ini
> > >    [global]
> > >    name=fio-rand-write
> > >    filename=$MNT/fio-rand-write
> > >    rw=randwrite
> > >    bs=4K
> > >    direct=1
> > >    numjobs=16
> > >    fallocate=none
> > >    time_based
> > >    runtime=90000
> > >
> > >    [file1]
> > >    size=300G
> > >    ioengine=libaio
> > >    iodepth=16
> > >
> > >    EOF
> > >
> > >    umount $MNT &> /dev/null
> > >    mkfs.btrfs -f $MKFS_OPTIONS $DEV
> > >    mount $MOUNT_OPTIONS $DEV $MNT
> > >
> > >    fio /tmp/fio-job.ini
> > >    umount $MNT
> > >
> > > Monitoring the btrfs_extent_map slab while running the test with:
> > >
> > >    $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
> > >                         /sys/kernel/slab/btrfs_extent_map/total_objects'
> > >
> > > Shows the number of active and total extent maps skyrocketing to tens of
> > > millions, and on systems with a short amount of memory it's easy and quick
> > > to get into an OOM situation, as reported in that thread.
> > >
> > > So to avoid this issue add a shrinker that will remove extents maps, as
> > > long as they are not pinned, and takes proper care with any concurrent
> > > fsync to avoid missing extents (setting the full sync flag while in the
> > > middle of a fast fsync). This shrinker is similar to the one ext4 uses
> > > for its extent_status structure, which is analogous to btrfs' extent_map
> > > structure.
> > >
> > > Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >
> > I don't like this for a few reasons
> >
> > 1. We're always starting with the first root and the first inode.  We're just
> >    going to constantly screw that first inode over and over again.
> > 2. I really, really hate our inode rb-tree, I want to reduce it's use, not add
> >    more users.  It would be nice if we could just utilize ->s_inodes_lru instead
> >    for this, which would also give us the nice advantage of not having to think
> >    about order since it's already in LRU order.
> > 3. We're registering our own shrinker without a proper LRU setup.  I think it
> >    would make sense if we wanted to have a LRU for our extent maps, but I think
> >    that's not a great idea.  We could get the same benefit by adding our own
> >    ->nr_cached_objects() and ->free_cached_objects(), I think that's a better
> >    approach no matter what other changes you make instead of registering our own
> >    shrinker.
>
> Ok, so some comments about all that.
>
> Using sb->s_inode_lru, which has the struct list_lru type is complicated.
> I had considered that some time ago, and here are the problems with it:
>
> 1) List iteration, done with list_lru_walk() (or one of its variants),
> is done while holding the lru list's spinlock.
>
> This means we can't remove all extents maps in one go as that will
> take too much time if we have millions of them for example, and make
> other tasks spin for too long on the lock.
>
> We will also need to take the inode's mmap semaphore which is a
> blocking operation, so we can't do it while under the spinlock.
>
> Sure, the lru list iteration's callback (the "isolate" argument of
> type list_lru_walk_cb) can release that lru list spinlock, do the
> extent map iteration and removal, then lock it again and return
> LRU_RETRY.
> But that will restart the search from the first element in the lru
> list. This means we can be iterating over the same inodes over and
> over.
>
> 2) You may hate the inode rb-tree, but we don't have another way to
> iterate over the inodes and be efficient to search for inodes starting
> at a particular number.
>
> 3)  ->nr_cached_objects() and ->free_cached_objects() of the super
> operations are a good alternative to registering our own shrinker.
> I went with our own shrinker because it's what ext4 is doing and it's
> simple. Changing to the super operations is fairly simple and I
> embrace it.
>
> 4) That concern about LRU is not really that relevant as you think.
>
> Look at fs/super.c:super_cache_scan() (for when we define our
> ->nr_cached_objects() and ->free_cached_objects() callbacks).
>
> Every time ->free_cached_objects() is called we do it for a number of
> items we returned with ->nr_cached_objects().
> That means we all go over all inodes and so going in LRU order or any
> other order ends up being irrelevant.
>
> The same goes for the shrinker solution.
>
> Sure you can argue that in the time between calling
> ->nr_cached_objects() and calling ->free_cached_objects() more extent
> maps may have been created.
> But that's not a big time window, not enough to add that many extent
> maps to make any practical difference.
> Plus when the shrinking is performed it's because we are under heavy
> memory pressure and removing extent maps won't have that much impact
> anyway, since page cache was freed amongst other things that have more
> impact on performance than extent maps.
>
> This is probably why ext4 is following the same approach for its
> extent_status structures as I did here (well, I followed their way of
> doing things).
> Those structures are in a rb tree of the inode, just like we do with
> our extent maps.
> And the inodes are in a regular linked list (struct
> ext4_sb_info::s_es_list) and added with list_add_tail to that list -
> they're never moved within the list (no LRU).
> When the scan callback of the shrinker is invoked, it always iterates
> the inodes in that list order - not LRU or anything "fancy".

Another example is xfs, which implements a ->nr_cached_objects() and
->free_cached_objects().
From xfs_fs_free_cached_objects() we end up at xfs_icwalk_ag, where it
walks inodes in the order they are in a radix tree, not in LRU order.
It does however keep track of the last index processed so that the
next time it is called, it starts from that index, but definitely not
LRU.

>
>
> So I'm okay with moving to an implementation based on
> ->nr_cached_objects() and calling ->free_cached_objects(), that's
> simple.
> But iterating the inodes will have to be with the rb-tree, as we don't
> have anything else that allows us to iterate and start at any given
> number.
> I can make it to remember the last scanned inode (and root) so that
> the next time the shrinking happens it will start after that last
> scanned inode.
>
> And adding our own lru implementation wouldn't make much difference
> because of all that, and would also have 2 downsides to it:
>
> 1) We would need a struct list_head  (16 bytes) plus a pointer to the
> inode (so that we can remove an extent map from the rb tree) added to
> the extent_map structure, so 24 bytes overhead.
>
> 2) All the maintenance of the list, adding to it, removing from it,
> moving elements, etc, all requires locking and overhead from lock
> contention (using the struct list_lru API or our own).
>
> So my proposal is to do this:
>
> 1) Keep using the rb tree to search for inodes - this is abstracted by
> an helper function used in 2 other places.
> 2) Remember the last scanned inode/root, and on every scan start after
> that inode.
> 3) Use ->nr_cached_objects() and calling ->free_cached_objects()
> instead of registering a shrinker.
>
> What do you think?
> Thanks.
>
> >
> > The concept I whole heartedly agree with, this just needs some tweaks to be more
> > fair and cleaner.  The rest of the code is fine, you can add
> >
> > Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> >
> > to the rest of it.  Thanks,
> >
> > Josef
Josef Bacik April 14, 2024, 1:02 p.m. UTC | #4
On Sat, Apr 13, 2024 at 12:07:30PM +0100, Filipe Manana wrote:
> On Fri, Apr 12, 2024 at 9:07 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Thu, Apr 11, 2024 at 05:19:07PM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > Extent maps are used either to represent existing file extent items, or to
> > > represent new extents that are going to be written and the respective file
> > > extent items are created when the ordered extent completes.
> > >
> > > We currently don't have any limit for how many extent maps we can have,
> > > neither per inode nor globally. Most of the time this not too noticeable
> > > because extent maps are removed in the following situations:
> > >
> > > 1) When evicting an inode;
> > >
> > > 2) When releasing folios (pages) through the btrfs_release_folio() address
> > >    space operation callback.
> > >
> > >    However we won't release extent maps in the folio range if the folio is
> > >    either dirty or under writeback or if the inode's i_size is less than
> > >    or equals to 16M (see try_release_extent_mapping(). This 16M i_size
> > >    constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
> > >    extent_io and extent_state optimizations"), but there's no explanation
> > >    about why we have it or why the 16M value.
> > >
> > > This means that for buffered IO we can reach an OOM situation due to too
> > > many extent maps if either of the following happens:
> > >
> > > 1) There's a set of tasks constantly doing IO on many files with a size
> > >    not larger than 16M, specially if they keep the files open for very
> > >    long periods, therefore preventing inode eviction.
> > >
> > >    This requires a really high number of such files, and having many non
> > >    mergeable extent maps (due to random 4K writes for example) and a
> > >    machine with very little memory;
> > >
> > > 2) There's a set tasks constantly doing random write IO (therefore
> > >    creating many non mergeable extent maps) on files and keeping them
> > >    open for long periods of time, so inode eviction doesn't happen and
> > >    there's always a lot of dirty pages or pages under writeback,
> > >    preventing btrfs_release_folio() from releasing the respective extent
> > >    maps.
> > >
> > > This second case was actually reported in the thread pointed by the Link
> > > tag below, and it requires a very large file under heavy IO and a machine
> > > with very little amount of RAM, which is probably hard to happen in
> > > practice in a real world use case.
> > >
> > > However when using direct IO this is not so hard to happen, because the
> > > page cache is not used, and therefore btrfs_release_folio() is never
> > > called. Which means extent maps are dropped only when evicting the inode,
> > > and that means that if we have tasks that keep a file descriptor open and
> > > keep doing IO on a very large file (or files), we can exhaust memory due
> > > to an unbounded amount of extent maps. This is especially easy to happen
> > > if we have a huge file with millions of small extents and their extent
> > > maps are not mergeable (non contiguous offsets and disk locations).
> > > This was reported in that thread with the following fio test:
> > >
> > >    $ cat test.sh
> > >    #!/bin/bash
> > >
> > >    DEV=/dev/sdj
> > >    MNT=/mnt/sdj
> > >    MOUNT_OPTIONS="-o ssd"
> > >    MKFS_OPTIONS=""
> > >
> > >    cat <<EOF > /tmp/fio-job.ini
> > >    [global]
> > >    name=fio-rand-write
> > >    filename=$MNT/fio-rand-write
> > >    rw=randwrite
> > >    bs=4K
> > >    direct=1
> > >    numjobs=16
> > >    fallocate=none
> > >    time_based
> > >    runtime=90000
> > >
> > >    [file1]
> > >    size=300G
> > >    ioengine=libaio
> > >    iodepth=16
> > >
> > >    EOF
> > >
> > >    umount $MNT &> /dev/null
> > >    mkfs.btrfs -f $MKFS_OPTIONS $DEV
> > >    mount $MOUNT_OPTIONS $DEV $MNT
> > >
> > >    fio /tmp/fio-job.ini
> > >    umount $MNT
> > >
> > > Monitoring the btrfs_extent_map slab while running the test with:
> > >
> > >    $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
> > >                         /sys/kernel/slab/btrfs_extent_map/total_objects'
> > >
> > > Shows the number of active and total extent maps skyrocketing to tens of
> > > millions, and on systems with a short amount of memory it's easy and quick
> > > to get into an OOM situation, as reported in that thread.
> > >
> > > So to avoid this issue add a shrinker that will remove extents maps, as
> > > long as they are not pinned, and takes proper care with any concurrent
> > > fsync to avoid missing extents (setting the full sync flag while in the
> > > middle of a fast fsync). This shrinker is similar to the one ext4 uses
> > > for its extent_status structure, which is analogous to btrfs' extent_map
> > > structure.
> > >
> > > Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >
> > I don't like this for a few reasons
> >
> > 1. We're always starting with the first root and the first inode.  We're just
> >    going to constantly screw that first inode over and over again.
> > 2. I really, really hate our inode rb-tree, I want to reduce it's use, not add
> >    more users.  It would be nice if we could just utilize ->s_inodes_lru instead
> >    for this, which would also give us the nice advantage of not having to think
> >    about order since it's already in LRU order.
> > 3. We're registering our own shrinker without a proper LRU setup.  I think it
> >    would make sense if we wanted to have a LRU for our extent maps, but I think
> >    that's not a great idea.  We could get the same benefit by adding our own
> >    ->nr_cached_objects() and ->free_cached_objects(), I think that's a better
> >    approach no matter what other changes you make instead of registering our own
> >    shrinker.
> 
> Ok, so some comments about all that.
> 
> Using sb->s_inode_lru, which has the struct list_lru type is complicated.
> I had considered that some time ago, and here are the problems with it:
> 
> 1) List iteration, done with list_lru_walk() (or one of its variants),
> is done while holding the lru list's spinlock.
> 
> This means we can't remove all extents maps in one go as that will
> take too much time if we have millions of them for example, and make
> other tasks spin for too long on the lock.
> 
> We will also need to take the inode's mmap semaphore which is a
> blocking operation, so we can't do it while under the spinlock.
> 
> Sure, the lru list iteration's callback (the "isolate" argument of
> type list_lru_walk_cb) can release that lru list spinlock, do the
> extent map iteration and removal, then lock it again and return
> LRU_RETRY.
> But that will restart the search from the first element in the lru
> list. This means we can be iterating over the same inodes over and
> over.

Agreed, but at least it'll be in LRU order, we'll be visiting each inode in
their LRU order.  If the first inode hasn't changed position then it hasn't been
accessed in a while.

We absolutely would need to use the isolate mechanism, I think we would just add
a counter to the extent_io_tree to see how many extent states there are, and we
skip anything that's already been drained.  Then we can do our normal locking
thing when we walk through the isolate list.

> 
> 2) You may hate the inode rb-tree, but we don't have another way to
> iterate over the inodes and be efficient to search for inodes starting
> at a particular number.
> 
> 3)  ->nr_cached_objects() and ->free_cached_objects() of the super
> operations are a good alternative to registering our own shrinker.
> I went with our own shrinker because it's what ext4 is doing and it's
> simple. Changing to the super operations is fairly simple and I
> embrace it.
> 
> 4) That concern about LRU is not really that relevant as you think.
> 
> Look at fs/super.c:super_cache_scan() (for when we define our
> ->nr_cached_objects() and ->free_cached_objects() callbacks).
> 
> Every time ->free_cached_objects() is called we do it for a number of
> items we returned with ->nr_cached_objects().
> That means we all go over all inodes and so going in LRU order or any
> other order ends up being irrelevant.
> 
> The same goes for the shrinker solution.
> 
> Sure you can argue that in the time between calling
> ->nr_cached_objects() and calling ->free_cached_objects() more extent
> maps may have been created.
> But that's not a big time window, not enough to add that many extent
> maps to make any practical difference.
> Plus when the shrinking is performed it's because we are under heavy
> memory pressure and removing extent maps won't have that much impact
> anyway, since page cache was freed amongst other things that have more
> impact on performance than extent maps.
> 
> This is probably why ext4 is following the same approach for its
> extent_status structures as I did here (well, I followed their way of
> doing things).
> Those structures are in a rb tree of the inode, just like we do with
> our extent maps.
> And the inodes are in a regular linked list (struct
> ext4_sb_info::s_es_list) and added with list_add_tail to that list -
> they're never moved within the list (no LRU).
> When the scan callback of the shrinker is invoked, it always iterates
> the inodes in that list order - not LRU or anything "fancy".
> 
> 
> So I'm okay with moving to an implementation based on
> ->nr_cached_objects() and calling ->free_cached_objects(), that's
> simple.
> But iterating the inodes will have to be with the rb-tree, as we don't
> have anything else that allows us to iterate and start at any given
> number.
> I can make it to remember the last scanned inode (and root) so that
> the next time the shrinking happens it will start after that last
> scanned inode.
> 
> And adding our own lru implementation wouldn't make much difference
> because of all that, and would also have 2 downsides to it:
> 
> 1) We would need a struct list_head  (16 bytes) plus a pointer to the
> inode (so that we can remove an extent map from the rb tree) added to
> the extent_map structure, so 24 bytes overhead.
> 
> 2) All the maintenance of the list, adding to it, removing from it,
> moving elements, etc, all requires locking and overhead from lock
> contention (using the struct list_lru API or our own).
> 
> So my proposal is to do this:
> 
> 1) Keep using the rb tree to search for inodes - this is abstracted by
> an helper function used in 2 other places.
> 2) Remember the last scanned inode/root, and on every scan start after
> that inode.
> 3) Use ->nr_cached_objects() and calling ->free_cached_objects()
> instead of registering a shrinker.
>

I'm still looking for a way to delete our rb_tree in the future, it causes us
headaches with a lot of parallel file creates, but honestly other things are
much worse than the rbtree right now.

This is fine with me, we're not getting rid of it anytime soon, as long as we
have a way to remember where we were then that's good enough for now and in
keeping with all other existing reclaim implementations.  Thanks,

Josef
Filipe Manana April 15, 2024, 11:24 a.m. UTC | #5
On Sun, Apr 14, 2024 at 2:02 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Sat, Apr 13, 2024 at 12:07:30PM +0100, Filipe Manana wrote:
> > On Fri, Apr 12, 2024 at 9:07 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > >
> > > On Thu, Apr 11, 2024 at 05:19:07PM +0100, fdmanana@kernel.org wrote:
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > Extent maps are used either to represent existing file extent items, or to
> > > > represent new extents that are going to be written and the respective file
> > > > extent items are created when the ordered extent completes.
> > > >
> > > > We currently don't have any limit for how many extent maps we can have,
> > > > neither per inode nor globally. Most of the time this not too noticeable
> > > > because extent maps are removed in the following situations:
> > > >
> > > > 1) When evicting an inode;
> > > >
> > > > 2) When releasing folios (pages) through the btrfs_release_folio() address
> > > >    space operation callback.
> > > >
> > > >    However we won't release extent maps in the folio range if the folio is
> > > >    either dirty or under writeback or if the inode's i_size is less than
> > > >    or equals to 16M (see try_release_extent_mapping(). This 16M i_size
> > > >    constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
> > > >    extent_io and extent_state optimizations"), but there's no explanation
> > > >    about why we have it or why the 16M value.
> > > >
> > > > This means that for buffered IO we can reach an OOM situation due to too
> > > > many extent maps if either of the following happens:
> > > >
> > > > 1) There's a set of tasks constantly doing IO on many files with a size
> > > >    not larger than 16M, specially if they keep the files open for very
> > > >    long periods, therefore preventing inode eviction.
> > > >
> > > >    This requires a really high number of such files, and having many non
> > > >    mergeable extent maps (due to random 4K writes for example) and a
> > > >    machine with very little memory;
> > > >
> > > > 2) There's a set tasks constantly doing random write IO (therefore
> > > >    creating many non mergeable extent maps) on files and keeping them
> > > >    open for long periods of time, so inode eviction doesn't happen and
> > > >    there's always a lot of dirty pages or pages under writeback,
> > > >    preventing btrfs_release_folio() from releasing the respective extent
> > > >    maps.
> > > >
> > > > This second case was actually reported in the thread pointed by the Link
> > > > tag below, and it requires a very large file under heavy IO and a machine
> > > > with very little amount of RAM, which is probably hard to happen in
> > > > practice in a real world use case.
> > > >
> > > > However when using direct IO this is not so hard to happen, because the
> > > > page cache is not used, and therefore btrfs_release_folio() is never
> > > > called. Which means extent maps are dropped only when evicting the inode,
> > > > and that means that if we have tasks that keep a file descriptor open and
> > > > keep doing IO on a very large file (or files), we can exhaust memory due
> > > > to an unbounded amount of extent maps. This is especially easy to happen
> > > > if we have a huge file with millions of small extents and their extent
> > > > maps are not mergeable (non contiguous offsets and disk locations).
> > > > This was reported in that thread with the following fio test:
> > > >
> > > >    $ cat test.sh
> > > >    #!/bin/bash
> > > >
> > > >    DEV=/dev/sdj
> > > >    MNT=/mnt/sdj
> > > >    MOUNT_OPTIONS="-o ssd"
> > > >    MKFS_OPTIONS=""
> > > >
> > > >    cat <<EOF > /tmp/fio-job.ini
> > > >    [global]
> > > >    name=fio-rand-write
> > > >    filename=$MNT/fio-rand-write
> > > >    rw=randwrite
> > > >    bs=4K
> > > >    direct=1
> > > >    numjobs=16
> > > >    fallocate=none
> > > >    time_based
> > > >    runtime=90000
> > > >
> > > >    [file1]
> > > >    size=300G
> > > >    ioengine=libaio
> > > >    iodepth=16
> > > >
> > > >    EOF
> > > >
> > > >    umount $MNT &> /dev/null
> > > >    mkfs.btrfs -f $MKFS_OPTIONS $DEV
> > > >    mount $MOUNT_OPTIONS $DEV $MNT
> > > >
> > > >    fio /tmp/fio-job.ini
> > > >    umount $MNT
> > > >
> > > > Monitoring the btrfs_extent_map slab while running the test with:
> > > >
> > > >    $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
> > > >                         /sys/kernel/slab/btrfs_extent_map/total_objects'
> > > >
> > > > Shows the number of active and total extent maps skyrocketing to tens of
> > > > millions, and on systems with a short amount of memory it's easy and quick
> > > > to get into an OOM situation, as reported in that thread.
> > > >
> > > > So to avoid this issue add a shrinker that will remove extents maps, as
> > > > long as they are not pinned, and takes proper care with any concurrent
> > > > fsync to avoid missing extents (setting the full sync flag while in the
> > > > middle of a fast fsync). This shrinker is similar to the one ext4 uses
> > > > for its extent_status structure, which is analogous to btrfs' extent_map
> > > > structure.
> > > >
> > > > Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
> > > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > >
> > > I don't like this for a few reasons
> > >
> > > 1. We're always starting with the first root and the first inode.  We're just
> > >    going to constantly screw that first inode over and over again.
> > > 2. I really, really hate our inode rb-tree, I want to reduce it's use, not add
> > >    more users.  It would be nice if we could just utilize ->s_inodes_lru instead
> > >    for this, which would also give us the nice advantage of not having to think
> > >    about order since it's already in LRU order.
> > > 3. We're registering our own shrinker without a proper LRU setup.  I think it
> > >    would make sense if we wanted to have a LRU for our extent maps, but I think
> > >    that's not a great idea.  We could get the same benefit by adding our own
> > >    ->nr_cached_objects() and ->free_cached_objects(), I think that's a better
> > >    approach no matter what other changes you make instead of registering our own
> > >    shrinker.
> >
> > Ok, so some comments about all that.
> >
> > Using sb->s_inode_lru, which has the struct list_lru type is complicated.
> > I had considered that some time ago, and here are the problems with it:
> >
> > 1) List iteration, done with list_lru_walk() (or one of its variants),
> > is done while holding the lru list's spinlock.
> >
> > This means we can't remove all extents maps in one go as that will
> > take too much time if we have millions of them for example, and make
> > other tasks spin for too long on the lock.
> >
> > We will also need to take the inode's mmap semaphore which is a
> > blocking operation, so we can't do it while under the spinlock.
> >
> > Sure, the lru list iteration's callback (the "isolate" argument of
> > type list_lru_walk_cb) can release that lru list spinlock, do the
> > extent map iteration and removal, then lock it again and return
> > LRU_RETRY.
> > But that will restart the search from the first element in the lru
> > list. This means we can be iterating over the same inodes over and
> > over.
>
> Agreed, but at least it'll be in LRU order, we'll be visiting each inode in
> their LRU order.  If the first inode hasn't changed position then it hasn't been
> accessed in a while.
>
> We absolutely would need to use the isolate mechanism, I think we would just add
> a counter to the extent_io_tree to see how many extent states there are, and we
> skip anything that's already been drained.

Looking at the extent io tree is irrelevant for many scenarios, and
LRU is often a very poor choice.

For example consider an application that keeps a file descriptor open
for a very long period, like a database server (I've worked in the
past on one that did that).
If it's using direct IO and does some random reads here and there,
then caches the data in its memory (not the page cache).
The extent io tree will be empty and we'll have around the extent maps
for those ranges which the application will not use for a very long
period, maybe hours or more.

Despite that it is the most recently used inode, yet it's the one with
the most useless extents maps.

The same goes for writes, the application will cache the data in
memory and having the extent map around is useless for it.
Even if it overwrites those same ranges in the near future, since for
COW writes we create a new extent map and drop the old one, and for
NOCOW we always check file extent items in the subvolume tree, not
using any extent maps in the range. That is, the extent maps aren't
needed for the writes.

And same as before, the inode will be the most recently used.

What ext4 and xfs are doing is just fine.
There's no need to complicate something that when triggered is when
the system is already under pressure and everything is already slow
due to the memory pressure.

> Then we can do our normal locking
> thing when we walk through the isolate list.
>
> >
> > 2) You may hate the inode rb-tree, but we don't have another way to
> > iterate over the inodes and be efficient to search for inodes starting
> > at a particular number.
> >
> > 3)  ->nr_cached_objects() and ->free_cached_objects() of the super
> > operations are a good alternative to registering our own shrinker.
> > I went with our own shrinker because it's what ext4 is doing and it's
> > simple. Changing to the super operations is fairly simple and I
> > embrace it.
> >
> > 4) That concern about LRU is not really that relevant as you think.
> >
> > Look at fs/super.c:super_cache_scan() (for when we define our
> > ->nr_cached_objects() and ->free_cached_objects() callbacks).
> >
> > Every time ->free_cached_objects() is called we do it for a number of
> > items we returned with ->nr_cached_objects().
> > That means we all go over all inodes and so going in LRU order or any
> > other order ends up being irrelevant.
> >
> > The same goes for the shrinker solution.
> >
> > Sure you can argue that in the time between calling
> > ->nr_cached_objects() and calling ->free_cached_objects() more extent
> > maps may have been created.
> > But that's not a big time window, not enough to add that many extent
> > maps to make any practical difference.
> > Plus when the shrinking is performed it's because we are under heavy
> > memory pressure and removing extent maps won't have that much impact
> > anyway, since page cache was freed amongst other things that have more
> > impact on performance than extent maps.
> >
> > This is probably why ext4 is following the same approach for its
> > extent_status structures as I did here (well, I followed their way of
> > doing things).
> > Those structures are in a rb tree of the inode, just like we do with
> > our extent maps.
> > And the inodes are in a regular linked list (struct
> > ext4_sb_info::s_es_list) and added with list_add_tail to that list -
> > they're never moved within the list (no LRU).
> > When the scan callback of the shrinker is invoked, it always iterates
> > the inodes in that list order - not LRU or anything "fancy".
> >
> >
> > So I'm okay with moving to an implementation based on
> > ->nr_cached_objects() and calling ->free_cached_objects(), that's
> > simple.
> > But iterating the inodes will have to be with the rb-tree, as we don't
> > have anything else that allows us to iterate and start at any given
> > number.
> > I can make it to remember the last scanned inode (and root) so that
> > the next time the shrinking happens it will start after that last
> > scanned inode.
> >
> > And adding our own lru implementation wouldn't make much difference
> > because of all that, and would also have 2 downsides to it:
> >
> > 1) We would need a struct list_head  (16 bytes) plus a pointer to the
> > inode (so that we can remove an extent map from the rb tree) added to
> > the extent_map structure, so 24 bytes overhead.
> >
> > 2) All the maintenance of the list, adding to it, removing from it,
> > moving elements, etc, all requires locking and overhead from lock
> > contention (using the struct list_lru API or our own).
> >
> > So my proposal is to do this:
> >
> > 1) Keep using the rb tree to search for inodes - this is abstracted by
> > an helper function used in 2 other places.
> > 2) Remember the last scanned inode/root, and on every scan start after
> > that inode.
> > 3) Use ->nr_cached_objects() and calling ->free_cached_objects()
> > instead of registering a shrinker.
> >
>
> I'm still looking for a way to delete our rb_tree in the future, it causes us
> headaches with a lot of parallel file creates, but honestly other things are
> much worse than the rbtree right now.

I get it, I've thought sometime ago to replace it with an xarray or maple tree.
The rb tree gets really deep after 100k inodes, plus all the cache
line pulling every time navigating the tree for a search or insertion.

On the other hand it's embedded in the inode, and unlike when using an
xarray or maple tree, doesn't require extra memory allocations for
insertions and dealing with allocation failures.
I will give it a try sometime after finishing the extent map shrinker.
After those preparatory patches it's all hidden in a helper function,
making the switch to some other data structure easy.

Thanks.

>
> This is fine with me, we're not getting rid of it anytime soon, as long as we
> have a way to remember where we were then that's good enough for now and in
> keeping with all other existing reclaim implementations.  Thanks,
>
> Josef
diff mbox series

Patch

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3c2d35b2062e..8bb295eaf3d7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1266,11 +1266,10 @@  static void free_global_roots(struct btrfs_fs_info *fs_info)
 
 void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
 {
+	btrfs_unregister_extent_map_shrinker(fs_info);
 	percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
 	percpu_counter_destroy(&fs_info->delalloc_bytes);
 	percpu_counter_destroy(&fs_info->ordered_bytes);
-	ASSERT(percpu_counter_sum_positive(&fs_info->evictable_extent_maps) == 0);
-	percpu_counter_destroy(&fs_info->evictable_extent_maps);
 	percpu_counter_destroy(&fs_info->dev_replace.bio_counter);
 	btrfs_free_csum_hash(fs_info);
 	btrfs_free_stripe_hash_table(fs_info);
@@ -2846,11 +2845,11 @@  static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block
 	sb->s_blocksize = BTRFS_BDEV_BLOCKSIZE;
 	sb->s_blocksize_bits = blksize_bits(BTRFS_BDEV_BLOCKSIZE);
 
-	ret = percpu_counter_init(&fs_info->ordered_bytes, 0, GFP_KERNEL);
+	ret = btrfs_register_extent_map_shrinker(fs_info);
 	if (ret)
 		return ret;
 
-	ret = percpu_counter_init(&fs_info->evictable_extent_maps, 0, GFP_KERNEL);
+	ret = percpu_counter_init(&fs_info->ordered_bytes, 0, GFP_KERNEL);
 	if (ret)
 		return ret;
 
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 2fcf28148a81..9791acda5b57 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -8,6 +8,7 @@ 
 #include "extent_map.h"
 #include "compression.h"
 #include "btrfs_inode.h"
+#include "disk-io.h"
 
 
 static struct kmem_cache *extent_map_cache;
@@ -1026,3 +1027,158 @@  int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	free_extent_map(split_pre);
 	return ret;
 }
+
+static unsigned long btrfs_scan_inode(struct btrfs_inode *inode,
+				      unsigned long *scanned,
+				      unsigned long nr_to_scan)
+{
+	struct extent_map_tree *tree = &inode->extent_tree;
+	unsigned long nr_dropped = 0;
+	struct rb_node *node;
+
+	/*
+	 * Take the mmap lock so that we serialize with the inode logging phase
+	 * of fsync because we may need to set the full sync flag on the inode,
+	 * in case we have to remove extent maps in the tree's list of modified
+	 * extents. If we set the full sync flag in the inode while an fsync is
+	 * in progress, we may risk missing new extents because before the flag
+	 * is set, fsync decides to only wait for writeback to complete and then
+	 * during inode logging it sees the flag set and uses the subvolume tree
+	 * to find new extents, which may not be there yet because ordered
+	 * extents haven't completed yet.
+	 */
+	down_read(&inode->i_mmap_lock);
+	write_lock(&tree->lock);
+	node = rb_first_cached(&tree->map);
+	while (node) {
+		struct extent_map *em;
+
+		em = rb_entry(node, struct extent_map, rb_node);
+		node = rb_next(node);
+		(*scanned)++;
+
+		if (em->flags & EXTENT_FLAG_PINNED)
+			goto next;
+
+		if (!list_empty(&em->list))
+			btrfs_set_inode_full_sync(inode);
+
+		remove_extent_mapping(inode, em);
+		/* Drop the reference for the tree. */
+		free_extent_map(em);
+		nr_dropped++;
+next:
+		if (*scanned >= nr_to_scan)
+			break;
+
+		/*
+		 * Restart if we had to resched, and any extent maps that were
+		 * pinned before may have become unpinned after we released the
+		 * lock and took it again.
+		 */
+		if (cond_resched_rwlock_write(&tree->lock))
+			node = rb_first_cached(&tree->map);
+	}
+	write_unlock(&tree->lock);
+	up_read(&inode->i_mmap_lock);
+
+	return nr_dropped;
+}
+
+static unsigned long btrfs_scan_root(struct btrfs_root *root,
+				     unsigned long *scanned,
+				     unsigned long nr_to_scan)
+{
+	struct btrfs_inode *inode;
+	unsigned long nr_dropped = 0;
+	u64 min_ino = 0;
+
+	inode = btrfs_find_first_inode(root, min_ino);
+	while (inode) {
+		nr_dropped += btrfs_scan_inode(inode, scanned, nr_to_scan);
+
+		min_ino = btrfs_ino(inode) + 1;
+		iput(&inode->vfs_inode);
+
+		if (*scanned >= nr_to_scan)
+			break;
+
+		cond_resched();
+		inode = btrfs_find_first_inode(root, min_ino);
+	}
+
+	return nr_dropped;
+}
+
+static unsigned long btrfs_extent_maps_scan(struct shrinker *shrinker,
+					    struct shrink_control *sc)
+{
+	struct btrfs_fs_info *fs_info = shrinker->private_data;
+	unsigned long nr_dropped = 0;
+	unsigned long scanned = 0;
+	u64 next_root_id = 0;
+
+	while (scanned < sc->nr_to_scan) {
+		struct btrfs_root *root;
+		unsigned long count;
+
+		spin_lock(&fs_info->fs_roots_radix_lock);
+		count = radix_tree_gang_lookup(&fs_info->fs_roots_radix,
+					       (void **)&root, next_root_id, 1);
+		if (count == 0) {
+			spin_unlock(&fs_info->fs_roots_radix_lock);
+			break;
+		}
+		next_root_id = btrfs_root_id(root) + 1;
+		root = btrfs_grab_root(root);
+		spin_unlock(&fs_info->fs_roots_radix_lock);
+
+		if (!root)
+			continue;
+
+		if (is_fstree(btrfs_root_id(root)))
+			nr_dropped += btrfs_scan_root(root, &scanned, sc->nr_to_scan);
+
+		btrfs_put_root(root);
+	}
+
+	return nr_dropped;
+}
+
+static unsigned long btrfs_extent_maps_count(struct shrinker *shrinker,
+					     struct shrink_control *sc)
+{
+	struct btrfs_fs_info *fs_info = shrinker->private_data;
+
+	return percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
+}
+
+int btrfs_register_extent_map_shrinker(struct btrfs_fs_info *fs_info)
+{
+	int ret;
+
+	ret = percpu_counter_init(&fs_info->evictable_extent_maps, 0, GFP_KERNEL);
+	if (ret)
+		return ret;
+
+	fs_info->extent_map_shrinker = shrinker_alloc(0, "em-btrfs:%s", fs_info->sb->s_id);
+	if (!fs_info->extent_map_shrinker) {
+		percpu_counter_destroy(&fs_info->evictable_extent_maps);
+		return -ENOMEM;
+	}
+
+	fs_info->extent_map_shrinker->scan_objects = btrfs_extent_maps_scan;
+	fs_info->extent_map_shrinker->count_objects = btrfs_extent_maps_count;
+	fs_info->extent_map_shrinker->private_data = fs_info;
+
+	shrinker_register(fs_info->extent_map_shrinker);
+
+	return 0;
+}
+
+void btrfs_unregister_extent_map_shrinker(struct btrfs_fs_info *fs_info)
+{
+	shrinker_free(fs_info->extent_map_shrinker);
+	ASSERT(percpu_counter_sum_positive(&fs_info->evictable_extent_maps) == 0);
+	percpu_counter_destroy(&fs_info->evictable_extent_maps);
+}
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index c3707461ff62..8a6be2f7a0e2 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -140,5 +140,7 @@  void btrfs_drop_extent_map_range(struct btrfs_inode *inode,
 int btrfs_replace_extent_map_range(struct btrfs_inode *inode,
 				   struct extent_map *new_em,
 				   bool modified);
+int btrfs_register_extent_map_shrinker(struct btrfs_fs_info *fs_info);
+void btrfs_unregister_extent_map_shrinker(struct btrfs_fs_info *fs_info);
 
 #endif
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 534d30dafe32..f1414814bd69 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -857,6 +857,8 @@  struct btrfs_fs_info {
 	struct lockdep_map btrfs_trans_pending_ordered_map;
 	struct lockdep_map btrfs_ordered_extent_map;
 
+	struct shrinker *extent_map_shrinker;
+
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
 	struct rb_root block_tree;