mbox series

[00/21] Lock extents before pages

Message ID 20230302222506.14955-1-rgoldwyn@suse.de (mailing list archive)
Headers show
Series Lock extents before pages | expand

Message

Goldwyn Rodrigues March 2, 2023, 10:24 p.m. UTC
The main idea is that instead of handling extent locks per page, lock
the extents before handling the pages. This will help in calling iomap
functions from within extent locks, so all page/folio handling is
performed by iomap code.

The patch "debug extent locking" was added to debug this code to check
where the locks were initiated in case of deadlock. No need to add
this patch.

Changes since v1:
 - qgroup flush reversal, instead pass nowait=true and handle
   qgroup flush if qgroup reservations return EDQUOT
 - relocation - locking extents before pages
 - async code locking fix: async code deadlocked because parallel
   writebacks could set HAS_ASYNC_EXTENTS flag while other thread was
   performing a sync writeback. This required re-writing the async writeback
   path
 - incorporated lock range checks using WARN_ON(), though I must
   admit locking during truncate still looks ugly
 - added readahead_begin() for locking extents before calling readahead()
   since readahead is called with pages locked.

Goldwyn Rodrigues (21):
  fs: readahead_begin() to call before locking folio
  btrfs: add WARN_ON() on incorrect lock range
  btrfs: Add start < end check in btrfs_debug_check_extent_io_range()
  btrfs: make btrfs_qgroup_flush() non-static
  btrfs: Lock extents before pages for buffered write()
  btrfs: wait ordered range before locking during truncate
  btrfs: lock extents while truncating
  btrfs: no need to lock extent while performing invalidate_folio()
  btrfs: lock extents before folio for read()s
  btrfs: lock extents before pages in writepages
  btrfs: locking extents for async writeback
  btrfs: lock extents before pages - defrag
  btrfs: Perform memory faults under locked extent
  btrfs: writepage fixup lock rearrangement
  btrfs: lock extent before pages for encoded read ioctls
  btrfs: lock extent before pages in encoded write
  btrfs: btree_writepages lock extents before pages
  btrfs: check if writeback pages exist before starting writeback
  btrfs: lock extents before pages in relocation
  btrfs: Add inode->i_count instead of calling ihold()
  btrfs: debug extent locking

 fs/btrfs/compression.c       |   9 +-
 fs/btrfs/defrag.c            |  48 +---
 fs/btrfs/disk-io.c           |  28 ++-
 fs/btrfs/extent-io-tree.c    |  32 +--
 fs/btrfs/extent-io-tree.h    |  46 ++--
 fs/btrfs/extent_io.c         |  74 +-----
 fs/btrfs/extent_io.h         |   2 +
 fs/btrfs/extent_map.c        |   2 +-
 fs/btrfs/file.c              | 100 +++-----
 fs/btrfs/free-space-cache.c  |   2 +-
 fs/btrfs/inode.c             | 431 ++++++++++++++++++-----------------
 fs/btrfs/ordered-data.c      |   8 +-
 fs/btrfs/ordered-data.h      |   3 +-
 fs/btrfs/qgroup.c            |   6 +-
 fs/btrfs/qgroup.h            |   1 +
 fs/btrfs/relocation.c        |  44 ++--
 include/linux/fs.h           |   1 +
 include/trace/events/btrfs.h |  18 +-
 mm/readahead.c               |   3 +
 19 files changed, 397 insertions(+), 461 deletions(-)

Comments

David Sterba March 9, 2023, 6:10 p.m. UTC | #1
On Thu, Mar 02, 2023 at 04:24:45PM -0600, Goldwyn Rodrigues wrote:
> The main idea is that instead of handling extent locks per page, lock
> the extents before handling the pages. This will help in calling iomap
> functions from within extent locks, so all page/folio handling is
> performed by iomap code.
> 
> The patch "debug extent locking" was added to debug this code to check
> where the locks were initiated in case of deadlock. No need to add
> this patch.
> 
> Changes since v1:
>  - qgroup flush reversal, instead pass nowait=true and handle
>    qgroup flush if qgroup reservations return EDQUOT
>  - relocation - locking extents before pages
>  - async code locking fix: async code deadlocked because parallel
>    writebacks could set HAS_ASYNC_EXTENTS flag while other thread was
>    performing a sync writeback. This required re-writing the async writeback
>    path
>  - incorporated lock range checks using WARN_ON(), though I must
>    admit locking during truncate still looks ugly
>  - added readahead_begin() for locking extents before calling readahead()
>    since readahead is called with pages locked.
> 
> Goldwyn Rodrigues (21):
>   fs: readahead_begin() to call before locking folio
>   btrfs: add WARN_ON() on incorrect lock range
>   btrfs: Add start < end check in btrfs_debug_check_extent_io_range()
>   btrfs: make btrfs_qgroup_flush() non-static
>   btrfs: Lock extents before pages for buffered write()
>   btrfs: wait ordered range before locking during truncate
>   btrfs: lock extents while truncating
>   btrfs: no need to lock extent while performing invalidate_folio()
>   btrfs: lock extents before folio for read()s
>   btrfs: lock extents before pages in writepages
>   btrfs: locking extents for async writeback
>   btrfs: lock extents before pages - defrag
>   btrfs: Perform memory faults under locked extent
>   btrfs: writepage fixup lock rearrangement
>   btrfs: lock extent before pages for encoded read ioctls
>   btrfs: lock extent before pages in encoded write
>   btrfs: btree_writepages lock extents before pages
>   btrfs: check if writeback pages exist before starting writeback
>   btrfs: lock extents before pages in relocation
>   btrfs: Add inode->i_count instead of calling ihold()
>   btrfs: debug extent locking

I've picked the branch from your git repository (applies on top of
misc-next) and will add it to for-next. We may need to keep it there for
some time, with this kind of core change it's not possible to do reverts
in case we find serious problems.

Optimistic plan is to get it stabilized during current dev cycle and
then one more cycle for stabilization once it's in the master branch.

A general comment for all patches is that the changes need more
explanation regarding the effects of the reversed locking for the
particular context. With the longer time span for the merge we can add
it as we find bugs or need to clarify things.
Christoph Hellwig March 10, 2023, 7:50 a.m. UTC | #2
On Thu, Mar 09, 2023 at 07:10:51PM +0100, David Sterba wrote:
> I've picked the branch from your git repository (applies on top of
> misc-next) and will add it to for-next. We may need to keep it there for
> some time, with this kind of core change it's not possible to do reverts
> in case we find serious problems.

I am really worried about the i_count hack.  It is fundamentlly wrong
and breaks inode lifetime rules.  Furthermore there is no attempt
at understanding why it happens (or even if it is new with this series).

> 
> Optimistic plan is to get it stabilized during current dev cycle and
> then one more cycle for stabilization once it's in the master branch.

It is going to clash with quite a bit of other work going in this area
I fear.
Goldwyn Rodrigues March 10, 2023, 4:38 p.m. UTC | #3
On 23:50 09/03, Christoph Hellwig wrote:
> On Thu, Mar 09, 2023 at 07:10:51PM +0100, David Sterba wrote:
> > I've picked the branch from your git repository (applies on top of
> > misc-next) and will add it to for-next. We may need to keep it there for
> > some time, with this kind of core change it's not possible to do reverts
> > in case we find serious problems.
> 
> I am really worried about the i_count hack.  It is fundamentlly wrong
> and breaks inode lifetime rules.  Furthermore there is no attempt
> at understanding why it happens (or even if it is new with this series).

So this (another version) patch which first added working with i_count is:
8180ef8894fa ("Btrfs: keep inode pinned when compressing writes")

For async, inode is still referenced until all the pages
are written back and cleared in end_compressed_writeback().
evict() waits for all writeback to be completed. Josef, is this
correct?

I removed the ihold() and delayed iput sequence and shifted
unlock_extent() immediately after submit_compressed_extents(). It is
working fine with no crashes for compress tests. But since this is a
race condition I need to be sure if it is the correct thing to do. I
have updated the git [1] but it needs more testing.

In any case, there is another issue with scrubbing which I need to
resolve: btrfs/06[0-3]. I am on vacation next week and the earliest I
can get back to it is March 20.


> 
> > 
> > Optimistic plan is to get it stabilized during current dev cycle and
> > then one more cycle for stabilization once it's in the master branch.
> 
> It is going to clash with quite a bit of other work going in this area
> I fear.

[1] https://github.com/goldwynr/linux/tree/lock-rearrange
Christoph Hellwig March 13, 2023, 9:16 a.m. UTC | #4
On Fri, Mar 10, 2023 at 10:38:28AM -0600, Goldwyn Rodrigues wrote:
> > I am really worried about the i_count hack.  It is fundamentlly wrong
> > and breaks inode lifetime rules.  Furthermore there is no attempt
> > at understanding why it happens (or even if it is new with this series).
> 
> So this (another version) patch which first added working with i_count is:
> 8180ef8894fa ("Btrfs: keep inode pinned when compressing writes")
> 
> For async, inode is still referenced until all the pages
> are written back and cleared in end_compressed_writeback().
> evict() waits for all writeback to be completed. Josef, is this
> correct?
> 
> I removed the ihold() and delayed iput sequence and shifted
> unlock_extent() immediately after submit_compressed_extents(). It is
> working fine with no crashes for compress tests. But since this is a
> race condition I need to be sure if it is the correct thing to do. I
> have updated the git [1] but it needs more testing.

I think the import thing to remember here is:

 1) an inode is not getting fred as long as it has pages under writeback
 2) as long as you don't reference the inode after clearing the last
    PageWriteback bit, using the inode without an extra reference
    in the writeback code is fine

So I think we can just remove the extra referene after making sure that
the inode is not referenced after dropping the last writeback bit.
And handling of the writeback bit is quite funny in btrfs in a few
places.  Maybe I'll be able to find some time to dig into it while
you're away.
David Sterba March 15, 2023, 7:16 p.m. UTC | #5
On Thu, Mar 09, 2023 at 11:50:31PM -0800, Christoph Hellwig wrote:
> On Thu, Mar 09, 2023 at 07:10:51PM +0100, David Sterba wrote:
> > I've picked the branch from your git repository (applies on top of
> > misc-next) and will add it to for-next. We may need to keep it there for
> > some time, with this kind of core change it's not possible to do reverts
> > in case we find serious problems.
> 
> I am really worried about the i_count hack.  It is fundamentlly wrong
> and breaks inode lifetime rules.  Furthermore there is no attempt
> at understanding why it happens (or even if it is new with this series).

I don't think the i_count hack will stay until the final merge.

> > Optimistic plan is to get it stabilized during current dev cycle and
> > then one more cycle for stabilization once it's in the master branch.
> 
> It is going to clash with quite a bit of other work going in this area
> I fear.

Yeah so we'll need to decide order of merging. The locking change is
needed for iomap conversion and we need to do it at some point, knowing
that it's a risky change. Limiting amout of work in the same area is a
good thing also because of testing.