mbox series

[00/17] btrfs: restrain lock extent usage during writeback

Message ID cover.1713363472.git.josef@toxicpanda.com (mailing list archive)
Headers show
Series btrfs: restrain lock extent usage during writeback | expand

Message

Josef Bacik April 17, 2024, 2:35 p.m. UTC
Hello,

We use the extent lock to cover a pretty wide range of operations during
writeback, and this has made things like the iomap conversion more annoying.
It's also relatively complex how the rules work, and in fact are different
between compressed and normal writeback.

This series is meant to address this by pushing the extent lock down to cover a
discreet set of operations.  Being able to clearly define what the extent lock
is protecting will give us a better idea of how to reduce it's usage or possibly
replace it in the future.

The current state of affairs has been we lock the page for the range we're going
to write back, and then we lock the extent, and then we start writeback.  This
looks a few different ways

COW:
lock page
  lock extent
    allocate the extent
    insert the pinned extent
    allocate the ordered extent
    clear DELALLOC
  unlock extent
unlock page

NOCOW
lock page
  lock extent
    lookup the file extent
      check if the file extent can be used nocow
        if no:
          cow_file_range
        if yes:
          insert pinned extent if prealloc
          allocate ordered extent
          clear delalloc
  unlock extent
unlock page

COMPRESS
lock page
  lock extent
  unlock extent
    create async objects for the page ranges
    compress the pages
    submit the compressed extents
      lock extent
        allocate the extent
        insert the pinned extent
        allocate the ordered extent
        clear DELALLOC
      unlock extent
      unlock pages

As you can see there's some variations above, but in essence the extent lock
protects

- Calling into the allocator.
- The em_tree modifications (pinned extent).
- Allocating and attaching the ordered extent.
- Clearing the various bits on the extent_io_tree.
- Looking up the file extent items on disk in the NOCOW case.

We don't actually need to protect the allocator when we're allocating, that has
it's own locking, we're not gaining anything from the lock being held there.

We also don't need to protect looking up the file extent items in the NOCOW
case, we are protected by the page lock here.  There are 3 cases we would need
to be worried about here

- Prealloc.  This locks the inode and calls btrfs_wait_ordered_range() before
  doing the preallocation, so there will be no dirty pages in this range before
  it begins the preallocation.
- Zero range.  Same as above.
- O_DIRECT.  We invalidate the page range before we start the write, so again
  we're protected here.

This allows us to reduce the scope of the extent lock to the 3 things that
really matter

- Protecting the em tree modifications.  This makes sense, this is directly
  related to the inode range.
- Protecting the extent_io_tree modifications.  Again this is clear.
- Allocating and attaching the ordered extent.  Also clear.

With this in mind we can push the extent locking down into the functions that do
these operations, and drastically simplify the locking that we're doing here.

I've run this through several runs of the CI to validate that this doesn't break
anything.  Nothing has failed yet, but it is a rather scary looking change, so a
sharp review is absolutely necessary, and having a decent soak time.  I would
like to get this merged ASAP so we can start pounding on it and make sure it's
solid before I attempt other locking related changes.  Thanks,

Josef
        

Josef Bacik (17):
  btrfs: handle errors in btrfs_reloc_clone_csums properly
  btrfs: push all inline logic into cow_file_range
  btrfs: unlock all the pages with successful inline extent creation
  btrfs: move extent bit and page cleanup into cow_file_range_inline
  btrfs: lock extent when doing inline extent in compression
  btrfs: push the extent lock into btrfs_run_delalloc_range
  btrfs: push extent lock into run_delalloc_nocow
  btrfs: adjust while loop condition in run_delalloc_nocow
  btrfs: push extent lock down in run_delalloc_nocow
  btrfs: remove unlock_extent from run_delalloc_compressed
  btrfs: push extent lock into run_delalloc_cow
  btrfs: push extent lock into cow_file_range
  btrfs: push lock_extent into cow_file_range_inline
  btrfs: move can_cow_file_range_inline() outside of the extent lock
  btrfs: push lock_extent down in cow_file_range()
  btrfs: push extent lock down in submit_one_async_extent
  btrfs: add a cached state to extent_clear_unlock_delalloc

 fs/btrfs/extent_io.c    |   8 +-
 fs/btrfs/extent_io.h    |   2 +
 fs/btrfs/inode.c        | 279 +++++++++++++++++++++++-----------------
 fs/btrfs/ordered-data.c |   6 +
 fs/btrfs/ordered-data.h |   1 +
 fs/btrfs/relocation.c   |   4 +-
 6 files changed, 177 insertions(+), 123 deletions(-)

Comments

Christoph Hellwig April 18, 2024, 5:54 a.m. UTC | #1
On Wed, Apr 17, 2024 at 10:35:44AM -0400, Josef Bacik wrote:
> discreet set of operations.  Being able to clearly define what the extent lock
> is protecting will give us a better idea of how to reduce it's usage or possibly
> replace it in the future.

It should also allow to stop taking it in ->read_folio and ->readahead,
which is what is making the btrfs I/O path so weird and incompatbile
with the rest of the kernel.  I tried to get rid of just that a while
ago but spectacularly failed.  Maybe doing this in smaller steps and
by someone knowning the code better is going to be more successful.
Goldwyn Rodrigues April 18, 2024, 1:45 p.m. UTC | #2
On 22:54 17/04, Christoph Hellwig wrote:
> On Wed, Apr 17, 2024 at 10:35:44AM -0400, Josef Bacik wrote:
> > discreet set of operations.  Being able to clearly define what the extent lock
> > is protecting will give us a better idea of how to reduce it's usage or possibly
> > replace it in the future.
> 
> It should also allow to stop taking it in ->read_folio and ->readahead,
> which is what is making the btrfs I/O path so weird and incompatbile
> with the rest of the kernel.  I tried to get rid of just that a while
> ago but spectacularly failed.  Maybe doing this in smaller steps and
> by someone knowning the code better is going to be more successful.
> 

The only reason I have encountered for taking extent locks during reads
is for checksums. read()s collects checksums before submitting the bio
where as writeback() adds the checksums during bio completion.

So, there is a small window where a read() performed immediately after
writeback+truncate pages would give an EIO because the checksum is
not in the checksum tree and does not match the calculated checksum.

If we can delay retrieving the checksum or wait for ordered extents to
complete before performing the read, I think avoiding extent locks
during read is possible.
Christoph Hellwig April 18, 2024, 2:47 p.m. UTC | #3
On Thu, Apr 18, 2024 at 08:45:35AM -0500, Goldwyn Rodrigues wrote:
> The only reason I have encountered for taking extent locks during reads
> is for checksums. read()s collects checksums before submitting the bio
> where as writeback() adds the checksums during bio completion.
> 
> So, there is a small window where a read() performed immediately after
> writeback+truncate pages would give an EIO because the checksum is
> not in the checksum tree and does not match the calculated checksum.
> 
> If we can delay retrieving the checksum or wait for ordered extents to
> complete before performing the read, I think avoiding extent locks
> during read is possible.

And the fix for that is to only clear the writeback bit once the
ordered extent processing has finished, which is the other bit
making btrfs I/O so different from the core kernels expectations.
It is highly coupled with the extent lock semantics as far as I
can tell.
Josef Bacik April 19, 2024, 6:54 p.m. UTC | #4
On Thu, Apr 18, 2024 at 07:47:26AM -0700, Christoph Hellwig wrote:
> On Thu, Apr 18, 2024 at 08:45:35AM -0500, Goldwyn Rodrigues wrote:
> > The only reason I have encountered for taking extent locks during reads
> > is for checksums. read()s collects checksums before submitting the bio
> > where as writeback() adds the checksums during bio completion.
> > 
> > So, there is a small window where a read() performed immediately after
> > writeback+truncate pages would give an EIO because the checksum is
> > not in the checksum tree and does not match the calculated checksum.
> > 
> > If we can delay retrieving the checksum or wait for ordered extents to
> > complete before performing the read, I think avoiding extent locks
> > during read is possible.
> 
> And the fix for that is to only clear the writeback bit once the
> ordered extent processing has finished, which is the other bit
> making btrfs I/O so different from the core kernels expectations.
> It is highly coupled with the extent lock semantics as far as I
> can tell.
> 

They aren't coupled, but they are related.  My follow-up work is to make this
change and see what breaks, and then tackle the read side.  This should pave the
way for a straightforward iomap conversion.  Thanks,

Josef