[v3,00/13] btrfs: support read-write for subpage metadata

Message ID	20210325071445.90896-1-wqu@suse.com (mailing list archive)
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Qu Wenruo <wqu@suse.com> To: linux-btrfs@vger.kernel.org Subject: [PATCH v3 00/13] btrfs: support read-write for subpage metadata Date: Thu, 25 Mar 2021 15:14:32 +0800 Message-Id: <20210325071445.90896-1-wqu@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: support read-write for subpage metadata \| expand [v3,00/13] btrfs: support read-write for subpage metadata [v3,01/13] btrfs: add sysfs interface for supported sectorsize [v3,02/13] btrfs: use min() to replace open-code in btrfs_invalidatepage() [v3,03/13] btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage() [v3,04/13] btrfs: refactor how we iterate ordered extent in btrfs_invalidatepage() [v3,05/13] btrfs: introduce helpers for subpage dirty status [v3,06/13] btrfs: introduce helpers for subpage writeback status [v3,07/13] btrfs: allow btree_set_page_dirty() to do more sanity check on subpage metadata [v3,08/13] btrfs: support subpage metadata csum calculation at write time [v3,09/13] btrfs: make alloc_extent_buffer() check subpage dirty bitmap [v3,10/13] btrfs: make the page uptodate assert to be subpage compatible [v3,11/13] btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible [v3,12/13] btrfs: make set_btree_ioerr() accept extent buffer and to be subpage compatible [v3,13/13] btrfs: add subpage overview comments

Qu Wenruo March 25, 2021, 7:14 a.m. UTC

This patchset can be fetched from the following github repo, along with
the full subpage RW support:
https://github.com/adam900710/linux/tree/subpage

This patchset is for metadata read write support.

[FULL RW TEST]
Since the data write path is not included in this patchset, we can't
really test the patchset itself, but anyone can grab the patch from
github repo and do fstests/generic tests.

But at least the full RW patchset can pass -g generic/quick -x defrag
for now.

There are some known issues:

- Defrag behavior change
  Since current defrag is doing per-page defrag, to support subpage
  defrag, we need some change in the loop.
  E.g. if a page has both hole and regular extents in it, then defrag
  will rewrite the full 64K page.

  Thus for now, defrag related failure is expected.
  But this should only cause behavior difference, no crash nor hang is
  expected.

- No compression support yet
  There are at least 2 known bugs if forcing compression for subpage
  * Some hard coded PAGE_SIZE screwing up space rsv
  * Subpage ASSERT() triggered
    This is because some compression code is unlocking locked_page by
    calling extent_clear_unlock_delalloc() with locked_page == NULL.
  So for now compression is also disabled.

- Inode nbytes mismatch
  Still debugging.
  The fastest way to trigger is fsx using the following parameters:

    fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx

  Which would cause inode nbytes differs from expected value and
  triggers btrfs check error.

[DIFFERENCE AGAINST REGULAR SECTORSIZE]
The metadata part in fact has more new code than data part, as it has
some different behaviors compared to the regular sector size handling:

- No more page locking
  Now metadata read/write relies on extent io tree locking, other than
  page locking.
  This is to allow behaviors like read lock one eb while also try to
  read lock another eb in the same page.
  We can't rely on page lock as now we have multiple extent buffers in
  the same page.

- Page status update
  Now we use subpage wrappers to handle page status update.

- How to submit dirty extent buffers
  Instead of just grabbing extent buffer from page::private, we need to
  iterate all dirty extent buffers in the page and submit them.

[CHANGELOG]
v2:
- Rebased to latest misc-next
  No conflicts at all.

- Add new sysfs interface to grab supported RO/RW sectorsize
  This will allow mkfs.btrfs to detect unmountable fs better.

- Use newer naming schema for each patch
  No more "extent_io:" or "inode:" schema anymore.

- Move two pure cleanups to the series
  Patch 2~3, originally in RW part.

- Fix one uninitialized variable
  Patch 6.

v3:
- Rename the sysfs to supported_sectorsizes

- Rebased to latest misc-next branch
  This removes 2 cleanup patches.

- Add new overview comment for subpage metadata

Qu Wenruo (13):
  btrfs: add sysfs interface for supported sectorsize
  btrfs: use min() to replace open-code in btrfs_invalidatepage()
  btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
  btrfs: refactor how we iterate ordered extent in
    btrfs_invalidatepage()
  btrfs: introduce helpers for subpage dirty status
  btrfs: introduce helpers for subpage writeback status
  btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
    metadata
  btrfs: support subpage metadata csum calculation at write time
  btrfs: make alloc_extent_buffer() check subpage dirty bitmap
  btrfs: make the page uptodate assert to be subpage compatible
  btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
  btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
    compatible
  btrfs: add subpage overview comments

 fs/btrfs/disk-io.c   | 143 ++++++++++++++++++++++++++++++++++---------
 fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
 fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
 fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/subpage.h   |  17 +++++
 fs/btrfs/sysfs.c     |  15 +++++
 6 files changed, 441 insertions(+), 116 deletions(-)

Neal Gompa March 25, 2021, 12:20 p.m. UTC | #1

On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
>
> This patchset can be fetched from the following github repo, along with
> the full subpage RW support:
> https://github.com/adam900710/linux/tree/subpage
>
> This patchset is for metadata read write support.
>
> [FULL RW TEST]
> Since the data write path is not included in this patchset, we can't
> really test the patchset itself, but anyone can grab the patch from
> github repo and do fstests/generic tests.
>
> But at least the full RW patchset can pass -g generic/quick -x defrag
> for now.
>
> There are some known issues:
>
> - Defrag behavior change
>   Since current defrag is doing per-page defrag, to support subpage
>   defrag, we need some change in the loop.
>   E.g. if a page has both hole and regular extents in it, then defrag
>   will rewrite the full 64K page.
>
>   Thus for now, defrag related failure is expected.
>   But this should only cause behavior difference, no crash nor hang is
>   expected.
>
> - No compression support yet
>   There are at least 2 known bugs if forcing compression for subpage
>   * Some hard coded PAGE_SIZE screwing up space rsv
>   * Subpage ASSERT() triggered
>     This is because some compression code is unlocking locked_page by
>     calling extent_clear_unlock_delalloc() with locked_page == NULL.
>   So for now compression is also disabled.
>
> - Inode nbytes mismatch
>   Still debugging.
>   The fastest way to trigger is fsx using the following parameters:
>
>     fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx
>
>   Which would cause inode nbytes differs from expected value and
>   triggers btrfs check error.
>
> [DIFFERENCE AGAINST REGULAR SECTORSIZE]
> The metadata part in fact has more new code than data part, as it has
> some different behaviors compared to the regular sector size handling:
>
> - No more page locking
>   Now metadata read/write relies on extent io tree locking, other than
>   page locking.
>   This is to allow behaviors like read lock one eb while also try to
>   read lock another eb in the same page.
>   We can't rely on page lock as now we have multiple extent buffers in
>   the same page.
>
> - Page status update
>   Now we use subpage wrappers to handle page status update.
>
> - How to submit dirty extent buffers
>   Instead of just grabbing extent buffer from page::private, we need to
>   iterate all dirty extent buffers in the page and submit them.
>
> [CHANGELOG]
> v2:
> - Rebased to latest misc-next
>   No conflicts at all.
>
> - Add new sysfs interface to grab supported RO/RW sectorsize
>   This will allow mkfs.btrfs to detect unmountable fs better.
>
> - Use newer naming schema for each patch
>   No more "extent_io:" or "inode:" schema anymore.
>
> - Move two pure cleanups to the series
>   Patch 2~3, originally in RW part.
>
> - Fix one uninitialized variable
>   Patch 6.
>
> v3:
> - Rename the sysfs to supported_sectorsizes
>
> - Rebased to latest misc-next branch
>   This removes 2 cleanup patches.
>
> - Add new overview comment for subpage metadata
>
> Qu Wenruo (13):
>   btrfs: add sysfs interface for supported sectorsize
>   btrfs: use min() to replace open-code in btrfs_invalidatepage()
>   btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
>   btrfs: refactor how we iterate ordered extent in
>     btrfs_invalidatepage()
>   btrfs: introduce helpers for subpage dirty status
>   btrfs: introduce helpers for subpage writeback status
>   btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
>     metadata
>   btrfs: support subpage metadata csum calculation at write time
>   btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>   btrfs: make the page uptodate assert to be subpage compatible
>   btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
>   btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
>     compatible
>   btrfs: add subpage overview comments
>
>  fs/btrfs/disk-io.c   | 143 ++++++++++++++++++++++++++++++++++---------
>  fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
>  fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
>  fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/subpage.h   |  17 +++++
>  fs/btrfs/sysfs.c     |  15 +++++
>  6 files changed, 441 insertions(+), 116 deletions(-)
>
> --
> 2.30.1
>

Why wouldn't we just integrate full read-write support with the
caveats as described now? It seems to be relatively reasonable to do
that, and this patch set is essentially unusable without the rest of
it that does enable full read-write support.

Qu Wenruo March 25, 2021, 1:16 p.m. UTC | #2

On 2021/3/25 下午8:20, Neal Gompa wrote:
> On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> This patchset can be fetched from the following github repo, along with
>> the full subpage RW support:
>> https://github.com/adam900710/linux/tree/subpage
>>
>> This patchset is for metadata read write support.
>>
>> [FULL RW TEST]
>> Since the data write path is not included in this patchset, we can't
>> really test the patchset itself, but anyone can grab the patch from
>> github repo and do fstests/generic tests.
>>
>> But at least the full RW patchset can pass -g generic/quick -x defrag
>> for now.
>>
>> There are some known issues:
>>
>> - Defrag behavior change
>>    Since current defrag is doing per-page defrag, to support subpage
>>    defrag, we need some change in the loop.
>>    E.g. if a page has both hole and regular extents in it, then defrag
>>    will rewrite the full 64K page.
>>
>>    Thus for now, defrag related failure is expected.
>>    But this should only cause behavior difference, no crash nor hang is
>>    expected.
>>
>> - No compression support yet
>>    There are at least 2 known bugs if forcing compression for subpage
>>    * Some hard coded PAGE_SIZE screwing up space rsv
>>    * Subpage ASSERT() triggered
>>      This is because some compression code is unlocking locked_page by
>>      calling extent_clear_unlock_delalloc() with locked_page == NULL.
>>    So for now compression is also disabled.
>>
>> - Inode nbytes mismatch
>>    Still debugging.
>>    The fastest way to trigger is fsx using the following parameters:
>>
>>      fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx
>>
>>    Which would cause inode nbytes differs from expected value and
>>    triggers btrfs check error.
>>
>> [DIFFERENCE AGAINST REGULAR SECTORSIZE]
>> The metadata part in fact has more new code than data part, as it has
>> some different behaviors compared to the regular sector size handling:
>>
>> - No more page locking
>>    Now metadata read/write relies on extent io tree locking, other than
>>    page locking.
>>    This is to allow behaviors like read lock one eb while also try to
>>    read lock another eb in the same page.
>>    We can't rely on page lock as now we have multiple extent buffers in
>>    the same page.
>>
>> - Page status update
>>    Now we use subpage wrappers to handle page status update.
>>
>> - How to submit dirty extent buffers
>>    Instead of just grabbing extent buffer from page::private, we need to
>>    iterate all dirty extent buffers in the page and submit them.
>>
>> [CHANGELOG]
>> v2:
>> - Rebased to latest misc-next
>>    No conflicts at all.
>>
>> - Add new sysfs interface to grab supported RO/RW sectorsize
>>    This will allow mkfs.btrfs to detect unmountable fs better.
>>
>> - Use newer naming schema for each patch
>>    No more "extent_io:" or "inode:" schema anymore.
>>
>> - Move two pure cleanups to the series
>>    Patch 2~3, originally in RW part.
>>
>> - Fix one uninitialized variable
>>    Patch 6.
>>
>> v3:
>> - Rename the sysfs to supported_sectorsizes
>>
>> - Rebased to latest misc-next branch
>>    This removes 2 cleanup patches.
>>
>> - Add new overview comment for subpage metadata
>>
>> Qu Wenruo (13):
>>    btrfs: add sysfs interface for supported sectorsize
>>    btrfs: use min() to replace open-code in btrfs_invalidatepage()
>>    btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
>>    btrfs: refactor how we iterate ordered extent in
>>      btrfs_invalidatepage()
>>    btrfs: introduce helpers for subpage dirty status
>>    btrfs: introduce helpers for subpage writeback status
>>    btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
>>      metadata
>>    btrfs: support subpage metadata csum calculation at write time
>>    btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>    btrfs: make the page uptodate assert to be subpage compatible
>>    btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
>>    btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
>>      compatible
>>    btrfs: add subpage overview comments
>>
>>   fs/btrfs/disk-io.c   | 143 ++++++++++++++++++++++++++++++++++---------
>>   fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
>>   fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
>>   fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/subpage.h   |  17 +++++
>>   fs/btrfs/sysfs.c     |  15 +++++
>>   6 files changed, 441 insertions(+), 116 deletions(-)
>>
>> --
>> 2.30.1
>>
>
> Why wouldn't we just integrate full read-write support with the
> caveats as described now? It seems to be relatively reasonable to do
> that, and this patch set is essentially unusable without the rest of
> it that does enable full read-write support.

The metadata part is much more stable than data path (almost not touched
for several months), and the metadata part already has some difference
in its behavior, which needs review.

You point makes some sense, but I still don't believe pushing a super
large patchset does any help for the review.

If you want to test, you can grab the branch from the github repo.
If you want to review, the mails are all here for review.

In fact, we used to have subpage support sent as a big patchset from IBM
guys, but the result is only some preparation patches get merged, and
nothing more.

Using this multi-series method, we're already doing better work and
received more testing (to ensure regular sectorsize is not affected at
least).

Thanks,
Qu
>
>

Ritesh Harjani (IBM) March 28, 2021, 8:02 p.m. UTC | #3

On 21/03/25 09:16PM, Qu Wenruo wrote:
>
>
> On 2021/3/25 下午8:20, Neal Gompa wrote:
> > On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
> > >
> > > This patchset can be fetched from the following github repo, along with
> > > the full subpage RW support:
> > > https://github.com/adam900710/linux/tree/subpage
> > >
> > > This patchset is for metadata read write support.
> > >
> > > [FULL RW TEST]
> > > Since the data write path is not included in this patchset, we can't
> > > really test the patchset itself, but anyone can grab the patch from
> > > github repo and do fstests/generic tests.
> > >
> > > But at least the full RW patchset can pass -g generic/quick -x defrag
> > > for now.
> > >
> > > There are some known issues:
> > >
> > > - Defrag behavior change
> > >    Since current defrag is doing per-page defrag, to support subpage
> > >    defrag, we need some change in the loop.
> > >    E.g. if a page has both hole and regular extents in it, then defrag
> > >    will rewrite the full 64K page.
> > >
> > >    Thus for now, defrag related failure is expected.
> > >    But this should only cause behavior difference, no crash nor hang is
> > >    expected.
> > >
> > > - No compression support yet
> > >    There are at least 2 known bugs if forcing compression for subpage
> > >    * Some hard coded PAGE_SIZE screwing up space rsv
> > >    * Subpage ASSERT() triggered
> > >      This is because some compression code is unlocking locked_page by
> > >      calling extent_clear_unlock_delalloc() with locked_page == NULL.
> > >    So for now compression is also disabled.
> > >
> > > - Inode nbytes mismatch
> > >    Still debugging.
> > >    The fastest way to trigger is fsx using the following parameters:
> > >
> > >      fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx
> > >
> > >    Which would cause inode nbytes differs from expected value and
> > >    triggers btrfs check error.
> > >
> > > [DIFFERENCE AGAINST REGULAR SECTORSIZE]
> > > The metadata part in fact has more new code than data part, as it has
> > > some different behaviors compared to the regular sector size handling:
> > >
> > > - No more page locking
> > >    Now metadata read/write relies on extent io tree locking, other than
> > >    page locking.
> > >    This is to allow behaviors like read lock one eb while also try to
> > >    read lock another eb in the same page.
> > >    We can't rely on page lock as now we have multiple extent buffers in
> > >    the same page.
> > >
> > > - Page status update
> > >    Now we use subpage wrappers to handle page status update.
> > >
> > > - How to submit dirty extent buffers
> > >    Instead of just grabbing extent buffer from page::private, we need to
> > >    iterate all dirty extent buffers in the page and submit them.
> > >
> > > [CHANGELOG]
> > > v2:
> > > - Rebased to latest misc-next
> > >    No conflicts at all.
> > >
> > > - Add new sysfs interface to grab supported RO/RW sectorsize
> > >    This will allow mkfs.btrfs to detect unmountable fs better.
> > >
> > > - Use newer naming schema for each patch
> > >    No more "extent_io:" or "inode:" schema anymore.
> > >
> > > - Move two pure cleanups to the series
> > >    Patch 2~3, originally in RW part.
> > >
> > > - Fix one uninitialized variable
> > >    Patch 6.
> > >
> > > v3:
> > > - Rename the sysfs to supported_sectorsizes
> > >
> > > - Rebased to latest misc-next branch
> > >    This removes 2 cleanup patches.
> > >
> > > - Add new overview comment for subpage metadata
> > >
> > > Qu Wenruo (13):
> > >    btrfs: add sysfs interface for supported sectorsize
> > >    btrfs: use min() to replace open-code in btrfs_invalidatepage()
> > >    btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
> > >    btrfs: refactor how we iterate ordered extent in
> > >      btrfs_invalidatepage()
> > >    btrfs: introduce helpers for subpage dirty status
> > >    btrfs: introduce helpers for subpage writeback status
> > >    btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
> > >      metadata
> > >    btrfs: support subpage metadata csum calculation at write time
> > >    btrfs: make alloc_extent_buffer() check subpage dirty bitmap
> > >    btrfs: make the page uptodate assert to be subpage compatible
> > >    btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
> > >    btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
> > >      compatible
> > >    btrfs: add subpage overview comments
> > >
> > >   fs/btrfs/disk-io.c   | 143 ++++++++++++++++++++++++++++++++++---------
> > >   fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
> > >   fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
> > >   fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
> > >   fs/btrfs/subpage.h   |  17 +++++
> > >   fs/btrfs/sysfs.c     |  15 +++++
> > >   6 files changed, 441 insertions(+), 116 deletions(-)
> > >
> > > --
> > > 2.30.1
> > >
> >
> > Why wouldn't we just integrate full read-write support with the
> > caveats as described now? It seems to be relatively reasonable to do
> > that, and this patch set is essentially unusable without the rest of
> > it that does enable full read-write support.
>
> The metadata part is much more stable than data path (almost not touched
> for several months), and the metadata part already has some difference
> in its behavior, which needs review.
>
> You point makes some sense, but I still don't believe pushing a super
> large patchset does any help for the review.
>
> If you want to test, you can grab the branch from the github repo.
> If you want to review, the mails are all here for review.
>
> In fact, we used to have subpage support sent as a big patchset from IBM
> guys, but the result is only some preparation patches get merged, and
> nothing more.
>
> Using this multi-series method, we're already doing better work and
> received more testing (to ensure regular sectorsize is not affected at
> least).

Hi Qu Wenruo,

Sorry about chiming in late on this. I don't have any strong objection on either
approach. Although sometime back when I tested your RW support git tree on
Power, the unmount patch itself was crashing. I didn't debug it that time
(this was a month back or so), so I also didn't bother testing xfstests on Power.

But we do have an interest in making sure this patch series work on bs < ps
on Power platform. I can try helping with testing, reviewing (to best of my
knowledge) and fixing anything is possible :)

Let me try and pull your tree and test it on Power. Please let me know if there
is anything needs to be taken care apart from your github tree and btrfs-progs
branch with bs < ps support.

-ritesh

Qu Wenruo March 29, 2021, 2:01 a.m. UTC | #4

On 2021/3/29 上午4:02, Ritesh Harjani wrote:
> On 21/03/25 09:16PM, Qu Wenruo wrote:
>>
>>
>> On 2021/3/25 下午8:20, Neal Gompa wrote:
>>> On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>
>>>> This patchset can be fetched from the following github repo, along with
>>>> the full subpage RW support:
>>>> https://github.com/adam900710/linux/tree/subpage
>>>>
>>>> This patchset is for metadata read write support.
>>>>
>>>> [FULL RW TEST]
>>>> Since the data write path is not included in this patchset, we can't
>>>> really test the patchset itself, but anyone can grab the patch from
>>>> github repo and do fstests/generic tests.
>>>>
>>>> But at least the full RW patchset can pass -g generic/quick -x defrag
>>>> for now.
>>>>
>>>> There are some known issues:
>>>>
>>>> - Defrag behavior change
>>>>     Since current defrag is doing per-page defrag, to support subpage
>>>>     defrag, we need some change in the loop.
>>>>     E.g. if a page has both hole and regular extents in it, then defrag
>>>>     will rewrite the full 64K page.
>>>>
>>>>     Thus for now, defrag related failure is expected.
>>>>     But this should only cause behavior difference, no crash nor hang is
>>>>     expected.
>>>>
>>>> - No compression support yet
>>>>     There are at least 2 known bugs if forcing compression for subpage
>>>>     * Some hard coded PAGE_SIZE screwing up space rsv
>>>>     * Subpage ASSERT() triggered
>>>>       This is because some compression code is unlocking locked_page by
>>>>       calling extent_clear_unlock_delalloc() with locked_page == NULL.
>>>>     So for now compression is also disabled.
>>>>
>>>> - Inode nbytes mismatch
>>>>     Still debugging.
>>>>     The fastest way to trigger is fsx using the following parameters:
>>>>
>>>>       fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx
>>>>
>>>>     Which would cause inode nbytes differs from expected value and
>>>>     triggers btrfs check error.
>>>>
>>>> [DIFFERENCE AGAINST REGULAR SECTORSIZE]
>>>> The metadata part in fact has more new code than data part, as it has
>>>> some different behaviors compared to the regular sector size handling:
>>>>
>>>> - No more page locking
>>>>     Now metadata read/write relies on extent io tree locking, other than
>>>>     page locking.
>>>>     This is to allow behaviors like read lock one eb while also try to
>>>>     read lock another eb in the same page.
>>>>     We can't rely on page lock as now we have multiple extent buffers in
>>>>     the same page.
>>>>
>>>> - Page status update
>>>>     Now we use subpage wrappers to handle page status update.
>>>>
>>>> - How to submit dirty extent buffers
>>>>     Instead of just grabbing extent buffer from page::private, we need to
>>>>     iterate all dirty extent buffers in the page and submit them.
>>>>
>>>> [CHANGELOG]
>>>> v2:
>>>> - Rebased to latest misc-next
>>>>     No conflicts at all.
>>>>
>>>> - Add new sysfs interface to grab supported RO/RW sectorsize
>>>>     This will allow mkfs.btrfs to detect unmountable fs better.
>>>>
>>>> - Use newer naming schema for each patch
>>>>     No more "extent_io:" or "inode:" schema anymore.
>>>>
>>>> - Move two pure cleanups to the series
>>>>     Patch 2~3, originally in RW part.
>>>>
>>>> - Fix one uninitialized variable
>>>>     Patch 6.
>>>>
>>>> v3:
>>>> - Rename the sysfs to supported_sectorsizes
>>>>
>>>> - Rebased to latest misc-next branch
>>>>     This removes 2 cleanup patches.
>>>>
>>>> - Add new overview comment for subpage metadata
>>>>
>>>> Qu Wenruo (13):
>>>>     btrfs: add sysfs interface for supported sectorsize
>>>>     btrfs: use min() to replace open-code in btrfs_invalidatepage()
>>>>     btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
>>>>     btrfs: refactor how we iterate ordered extent in
>>>>       btrfs_invalidatepage()
>>>>     btrfs: introduce helpers for subpage dirty status
>>>>     btrfs: introduce helpers for subpage writeback status
>>>>     btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
>>>>       metadata
>>>>     btrfs: support subpage metadata csum calculation at write time
>>>>     btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>>>     btrfs: make the page uptodate assert to be subpage compatible
>>>>     btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
>>>>     btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
>>>>       compatible
>>>>     btrfs: add subpage overview comments
>>>>
>>>>    fs/btrfs/disk-io.c   | 143 ++++++++++++++++++++++++++++++++++---------
>>>>    fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
>>>>    fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
>>>>    fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
>>>>    fs/btrfs/subpage.h   |  17 +++++
>>>>    fs/btrfs/sysfs.c     |  15 +++++
>>>>    6 files changed, 441 insertions(+), 116 deletions(-)
>>>>
>>>> --
>>>> 2.30.1
>>>>
>>>
>>> Why wouldn't we just integrate full read-write support with the
>>> caveats as described now? It seems to be relatively reasonable to do
>>> that, and this patch set is essentially unusable without the rest of
>>> it that does enable full read-write support.
>>
>> The metadata part is much more stable than data path (almost not touched
>> for several months), and the metadata part already has some difference
>> in its behavior, which needs review.
>>
>> You point makes some sense, but I still don't believe pushing a super
>> large patchset does any help for the review.
>>
>> If you want to test, you can grab the branch from the github repo.
>> If you want to review, the mails are all here for review.
>>
>> In fact, we used to have subpage support sent as a big patchset from IBM
>> guys, but the result is only some preparation patches get merged, and
>> nothing more.
>>
>> Using this multi-series method, we're already doing better work and
>> received more testing (to ensure regular sectorsize is not affected at
>> least).
>
> Hi Qu Wenruo,
>
> Sorry about chiming in late on this. I don't have any strong objection on either
> approach. Although sometime back when I tested your RW support git tree on
> Power, the unmount patch itself was crashing. I didn't debug it that time
> (this was a month back or so), so I also didn't bother testing xfstests on Power.
>
> But we do have an interest in making sure this patch series work on bs < ps
> on Power platform. I can try helping with testing, reviewing (to best of my
> knowledge) and fixing anything is possible :)

That's great!

One of my biggest problem here is, I don't have good enough testing
environment.

Although SUSE has internal clouds for ARM64/PPC64, but due to the
f**king Great Firewall, it's super slow to access, no to mention doing
proper debugging.

Currently I'm using two ARM SBCs, RK3399 and A311D based, to do the test.
But their computing power is far from ideal, only generic/quick can
finish in hours.

Thus real world Power could definitely help.
>
> Let me try and pull your tree and test it on Power. Please let me know if there
> is anything needs to be taken care apart from your github tree and btrfs-progs
> branch with bs < ps support.

If you're going to test the branch, here are some small notes:

- Need to use latest btrfs-progs
   As it fixes a false alert on crossing 64K page boundary.

- Need to slightly modify btrfs-progs to avoid false alerts
   For subpage case, mkfs.btrfs will output a warning, but that warning
   is outputted into stderr, which will screw up generic test groups.
   It's recommended to apply the following diff:

diff --git a/common/fsfeatures.c b/common/fsfeatures.c
index 569208a9..21976554 100644
--- a/common/fsfeatures.c
+++ b/common/fsfeatures.c
@@ -341,8 +341,8 @@ int btrfs_check_sectorsize(u32 sectorsize)
                 return -EINVAL;
         }
         if (page_size != sectorsize)
-               warning(
-"the filesystem may not be mountable, sectorsize %u doesn't match page
size %u",
+               printf(
+"the filesystem may not be mountable, sectorsize %u doesn't match page
size %u\n",
                         sectorsize, page_size);
         return 0;
  }

- Xfstest/btrfs group will crash at btrfs/143
   Still investigating, but you can ignore btrfs group for now.

- Very rare hang
   There is a very low change to hang, with "bad ordered accounting"
   dmesg.
   If you can hit, please let me know.
   I had something idea to fix it, but not yet in the branch.

- btrfs inode nbytes mismatch
   Investigating, as it will make btrfs-check to report error.

The last two bugs are the final show blocker, I'll give you extra
updates when those are fixed.

Thanks,
Qu

>
> -ritesh
>
>

David Sterba March 29, 2021, 6:53 p.m. UTC | #5

On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
> v3:
> - Rename the sysfs to supported_sectorsizes
> 
> - Rebased to latest misc-next branch
>   This removes 2 cleanup patches.
> 
> - Add new overview comment for subpage metadata

V3 is now in for-next, targeting merge for 5.13. Please post any fixups
as replies to the individual patches, I'll fold them in, rather a full
series resend. Thanks.

Qu Wenruo April 1, 2021, 5:36 a.m. UTC | #6

On 2021/3/30 上午2:53, David Sterba wrote:
> On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
>> v3:
>> - Rename the sysfs to supported_sectorsizes
>>
>> - Rebased to latest misc-next branch
>>    This removes 2 cleanup patches.
>>
>> - Add new overview comment for subpage metadata
>
> V3 is now in for-next, targeting merge for 5.13. Please post any fixups
> as replies to the individual patches, I'll fold them in, rather a full
> series resend. Thanks.
>
Is it possible to drop patch "[PATCH v3 04/13] btrfs: refactor how we
iterate ordered extent in btrfs_invalidatepage()"?

Since in the series, there are no other patches touching it, dropping it
should not involve too much hassle.

The problem here is, how we handle ordered extent really belongs to the
data write path.

Furthermore, after all the data RW related testing, it turns out that
the ordered extent code has several problems:

- Separate indicators for ordered extent
   We use PagePriavte2 to indicate whether we have pending ordered extent
   io.
   But it is not properly integrated into ordered extent code, nor really
   properly documented.

- Complex call sites requirement
   For endio we don't care whether we finished the ordered extent, while
   for invalidatepage, we don't really need to bother if we finished all
   the ordered extents in the range.

   Thus we really don't need to bother who finished the ordered extents,
   but just want to mark the io finished for the range.

- Lack subpage compatibility
   That's why I'm here complaining, especially due to the PagePrivate2
   usage.
   It needs to be converted to a new bitmap.

There will be a refactor on the btrfs_dec_test_*_ordered_pending()
functions soon, and obvious the existing call sites will all be gone.

Thus that fourth patch makes no sense.

If needed, I can resend the patchset without that patch.

Thanks,
Qu

David Sterba April 1, 2021, 5:55 p.m. UTC | #7

On Thu, Apr 01, 2021 at 01:36:56PM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/3/30 上午2:53, David Sterba wrote:
> > On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
> >> v3:
> >> - Rename the sysfs to supported_sectorsizes
> >>
> >> - Rebased to latest misc-next branch
> >>    This removes 2 cleanup patches.
> >>
> >> - Add new overview comment for subpage metadata
> >
> > V3 is now in for-next, targeting merge for 5.13. Please post any fixups
> > as replies to the individual patches, I'll fold them in, rather a full
> > series resend. Thanks.
> >
> Is it possible to drop patch "[PATCH v3 04/13] btrfs: refactor how we
> iterate ordered extent in btrfs_invalidatepage()"?

Dropped, there were no conflicts in the followup patches.

> Since in the series, there are no other patches touching it, dropping it
> should not involve too much hassle.
> 
> The problem here is, how we handle ordered extent really belongs to the
> data write path.
> 
> Furthermore, after all the data RW related testing, it turns out that
> the ordered extent code has several problems:
> 
> - Separate indicators for ordered extent
>    We use PagePriavte2 to indicate whether we have pending ordered extent
>    io.
>    But it is not properly integrated into ordered extent code, nor really
>    properly documented.
> 
> - Complex call sites requirement
>    For endio we don't care whether we finished the ordered extent, while
>    for invalidatepage, we don't really need to bother if we finished all
>    the ordered extents in the range.
> 
>    Thus we really don't need to bother who finished the ordered extents,
>    but just want to mark the io finished for the range.
> 
> - Lack subpage compatibility
>    That's why I'm here complaining, especially due to the PagePrivate2
>    usage.
>    It needs to be converted to a new bitmap.
> 
> There will be a refactor on the btrfs_dec_test_*_ordered_pending()
> functions soon, and obvious the existing call sites will all be gone.
> 
> Thus that fourth patch makes no sense.

Ok, thanks for the explanation.

Anand Jain April 2, 2021, 1:27 a.m. UTC | #8

On 01/04/2021 13:36, Qu Wenruo wrote:
> 
> 
> On 2021/3/30 上午2:53, David Sterba wrote:
>> On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
>>> v3:
>>> - Rename the sysfs to supported_sectorsizes
>>>
>>> - Rebased to latest misc-next branch
>>>    This removes 2 cleanup patches.
>>>
>>> - Add new overview comment for subpage metadata
>>
>> V3 is now in for-next, targeting merge for 5.13. Please post any fixups
>> as replies to the individual patches, I'll fold them in, rather a full
>> series resend. Thanks.
>>
> Is it possible to drop patch "[PATCH v3 04/13] btrfs: refactor how we
> iterate ordered extent in btrfs_invalidatepage()"?
> 


  Oh. Just saw this. You may ignore my questions there.

Thanks, Anand


> Since in the series, there are no other patches touching it, dropping it
> should not involve too much hassle.
> 
> The problem here is, how we handle ordered extent really belongs to the
> data write path.
> 
> Furthermore, after all the data RW related testing, it turns out that
> the ordered extent code has several problems:
> 
> - Separate indicators for ordered extent
>   We use PagePriavte2 to indicate whether we have pending ordered extent
>   io.
>   But it is not properly integrated into ordered extent code, nor really
>   properly documented.
> 
> - Complex call sites requirement
>   For endio we don't care whether we finished the ordered extent, while
>   for invalidatepage, we don't really need to bother if we finished all
>   the ordered extents in the range.
> 
>   Thus we really don't need to bother who finished the ordered extents,
>   but just want to mark the io finished for the range.
> 
> - Lack subpage compatibility
>   That's why I'm here complaining, especially due to the PagePrivate2
>   usage.
>   It needs to be converted to a new bitmap.
> 
> There will be a refactor on the btrfs_dec_test_*_ordered_pending()
> functions soon, and obvious the existing call sites will all be gone.
> 
> Thus that fourth patch makes no sense.
> 
> If needed, I can resend the patchset without that patch.
> 
> Thanks,
> Qu

Anand Jain April 2, 2021, 1:39 a.m. UTC | #9

On 29/03/2021 10:01, Qu Wenruo wrote:
> 
> 
> On 2021/3/29 上午4:02, Ritesh Harjani wrote:
>> On 21/03/25 09:16PM, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/3/25 下午8:20, Neal Gompa wrote:
>>>> On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>>
>>>>> This patchset can be fetched from the following github repo, along 
>>>>> with
>>>>> the full subpage RW support:
>>>>> https://github.com/adam900710/linux/tree/subpage
>>>>>
>>>>> This patchset is for metadata read write support.
>>>>>
>>>>> [FULL RW TEST]
>>>>> Since the data write path is not included in this patchset, we can't
>>>>> really test the patchset itself, but anyone can grab the patch from
>>>>> github repo and do fstests/generic tests.
>>>>>
>>>>> But at least the full RW patchset can pass -g generic/quick -x defrag
>>>>> for now.
>>>>>
>>>>> There are some known issues:
>>>>>
>>>>> - Defrag behavior change
>>>>>     Since current defrag is doing per-page defrag, to support subpage
>>>>>     defrag, we need some change in the loop.
>>>>>     E.g. if a page has both hole and regular extents in it, then 
>>>>> defrag
>>>>>     will rewrite the full 64K page.
>>>>>
>>>>>     Thus for now, defrag related failure is expected.
>>>>>     But this should only cause behavior difference, no crash nor 
>>>>> hangis
>>>>>     expected.
>>>>>
>>>>> - No compression support yet
>>>>>     There are at least 2 known bugs if forcing compression for subpage
>>>>>     * Some hard coded PAGE_SIZE screwing up space rsv
>>>>>     * Subpage ASSERT() triggered
>>>>>       This is because some compression code is unlocking 
>>>>> locked_page by
>>>>>       calling extent_clear_unlock_delalloc() with locked_page == NULL.
>>>>>     So for now compression is also disabled.
>>>>>
>>>>> - Inode nbytes mismatch
>>>>>     Still debugging.
>>>>>     The fastest way to trigger is fsx using the following parameters:
>>>>>
>>>>>       fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > 
>>>>> /tmp/fsx
>>>>>
>>>>>     Which would cause inode nbytes differs from expected value and
>>>>>     triggers btrfs check error.
>>>>>
>>>>> [DIFFERENCE AGAINST REGULAR SECTORSIZE]
>>>>> The metadata part in fact has more new code than data part, as it has
>>>>> some different behaviors compared to the regular sector size handling:
>>>>>
>>>>> - No more page locking
>>>>>     Now metadata read/write relies on extent io tree locking, other 
>>>>> than
>>>>>     page locking.
>>>>>     This is to allow behaviors like read lock one eb while also try to
>>>>>     read lock another eb in the same page.
>>>>>     We can't rely on page lock as now we have multiple extent 
>>>>> buffersin
>>>>>     the same page.
>>>>>
>>>>> - Page status update
>>>>>     Now we use subpage wrappers to handle page status update.
>>>>>
>>>>> - How to submit dirty extent buffers
>>>>>     Instead of just grabbing extent buffer from page::private, we 
>>>>> need to
>>>>>     iterate all dirty extent buffers in the page and submit them.
>>>>>
>>>>> [CHANGELOG]
>>>>> v2:
>>>>> - Rebased to latest misc-next
>>>>>     No conflicts at all.
>>>>>
>>>>> - Add new sysfs interface to grab supported RO/RW sectorsize
>>>>>     This will allow mkfs.btrfs to detect unmountable fs better.
>>>>>
>>>>> - Use newer naming schema for each patch
>>>>>     No more "extent_io:" or "inode:" schema anymore.
>>>>>
>>>>> - Move two pure cleanups to the series
>>>>>     Patch 2~3, originally in RW part.
>>>>>
>>>>> - Fix one uninitialized variable
>>>>>     Patch 6.
>>>>>
>>>>> v3:
>>>>> - Rename the sysfs to supported_sectorsizes
>>>>>
>>>>> - Rebased to latest misc-next branch
>>>>>     This removes 2 cleanup patches.
>>>>>
>>>>> - Add new overview comment for subpage metadata
>>>>>
>>>>> Qu Wenruo (13):
>>>>>     btrfs: add sysfs interface for supported sectorsize
>>>>>     btrfs: use min() to replace open-code in btrfs_invalidatepage()
>>>>>     btrfs: remove unnecessary variable shadowing in 
>>>>> btrfs_invalidatepage()
>>>>>     btrfs: refactor how we iterate ordered extent in
>>>>>       btrfs_invalidatepage()
>>>>>     btrfs: introduce helpers for subpage dirty status
>>>>>     btrfs: introduce helpers for subpage writeback status
>>>>>     btrfs: allow btree_set_page_dirty() to do more sanity check on 
>>>>> subpage
>>>>>       metadata
>>>>>     btrfs: support subpage metadata csum calculation at write time
>>>>>     btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>>>>     btrfs: make the page uptodate assert to be subpage compatible
>>>>>     btrfs: make set/clear_extent_buffer_dirty() to be subpage 
>>>>> compatible
>>>>>     btrfs: make set_btree_ioerr() accept extent buffer and to be 
>>>>> subpage
>>>>>       compatible
>>>>>     btrfs: add subpage overview comments
>>>>>
>>>>>    fs/btrfs/disk-io.c   | 143 
>>>>> ++++++++++++++++++++++++++++++++++---------
>>>>>    fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
>>>>>    fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
>>>>>    fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
>>>>>    fs/btrfs/subpage.h   |  17 +++++
>>>>>    fs/btrfs/sysfs.c     |  15 +++++
>>>>>    6 files changed, 441 insertions(+), 116 deletions(-)
>>>>>
>>>>> -- 
>>>>> 2.30.1
>>>>>
>>>>
>>>> Why wouldn't we just integrate full read-write support with the
>>>> caveats as described now? It seems to be relatively reasonable to do
>>>> that, and this patch set is essentially unusable without the rest of
>>>> it that does enable full read-write support.
>>>
>>> The metadata part is much more stable than data path (almost not touched
>>> for several months), and the metadata part already has some difference
>>> in its behavior, which needs review.
>>>
>>> You point makes some sense, but I still don't believe pushing a super
>>> large patchset does any help for the review.
>>>
>>> If you want to test, you can grab the branch from the github repo.
>>> If you want to review, the mails are all here for review.
>>>
>>> In fact, we used to have subpage support sent as a big patchset from IBM
>>> guys, but the result is only some preparation patches get merged, and
>>> nothing more.
>>>
>>> Using this multi-series method, we're already doing better work and
>>> received more testing (to ensure regular sectorsize is not affected at
>>> least).
>>
>> Hi Qu Wenruo,
>>
>> Sorry about chiming in late on this. I don't have any strong objection 
>> on either
>> approach. Although sometime back when I tested your RW support git 
>> tree on
>> Power, the unmount patch itself was crashing. I didn't debug it that time
>> (this was a month back or so), so I also didn't bother testing 
>> xfstests on Power.
>>
>> But we do have an interest in making sure this patch series work on bs 
>> <ps
>> on Power platform. I can try helping with testing, reviewing (to best 
>> ofmy
>> knowledge) and fixing anything is possible :)
> 
> That's great!
> 
> One of my biggest problem here is, I don't have good enough testing
> environment.
> 
> Although SUSE has internal clouds for ARM64/PPC64, but due to the
> f**king Great Firewall, it's super slow to access, no to mention doing
> proper debugging.
> 
> Currently I'm using two ARM SBCs, RK3399 and A311D based, to do the test.
> But their computing power is far from ideal, only generic/quick can
> finish in hours.
> 
> Thus real world Power could definitely help.
>>
>> Let me try and pull your tree and test it on Power. Please let me know 
>> if there
>> is anything needs to be taken care apart from your github tree and 
>> btrfs-progs
>> branch with bs < ps support.
> 
> If you're going to test the branch, here are some small notes:
> 
> - Need to use latest btrfs-progs
>   As it fixes a false alert on crossing 64K page boundary.
> 
> - Need to slightly modify btrfs-progs to avoid false alerts
>   For subpage case, mkfs.btrfs will output a warning, but that warning
>   is outputted into stderr, which will screw up generic test groups.
>   It's recommended to apply the following diff:
> 
> diff --git a/common/fsfeatures.c b/common/fsfeatures.c
> index 569208a9..21976554 100644
> --- a/common/fsfeatures.c
> +++ b/common/fsfeatures.c
> @@ -341,8 +341,8 @@ int btrfs_check_sectorsize(u32 sectorsize)
>                 return -EINVAL;
>         }
>         if (page_size != sectorsize)
> -               warning(
> -"the filesystem may not be mountable, sectorsize %u doesn't match page
> size %u",
> +               printf(
> +"the filesystem may not be mountable, sectorsize %u doesn't match page
> size %u\n",
>                         sectorsize, page_size);
>         return 0;
> }
> 


> - Xfstest/btrfs group will crash at btrfs/143
>   Still investigating, but you can ignore btrfs group for now.
> 
> - Very rare hang
>   There is a very low change to hang, with "bad ordered accounting"
>   dmesg.
>   If you can hit, please let me know.
>   I had something idea to fix it, but not yet in the branch.
> 
> - btrfs inode nbytes mismatch
>   Investigating, as it will make btrfs-check to report error.
> 
> The last two bugs are the final show blocker, I'll give you extra
> updates when those are fixed.

  I am running the tests on aarch64 here. Are fixes for these known
  issues posted in the ML? I can't see them yet.

Thanks, Anand


> Thanks,
> Qu
> 
>>
>> -ritesh
>>
>>

Qu Wenruo April 2, 2021, 3:26 a.m. UTC | #10

On 2021/4/2 上午9:39, Anand Jain wrote:
> On 29/03/2021 10:01, Qu Wenruo wrote:
>>
>>
>> On 2021/3/29 上午4:02, Ritesh Harjani wrote:
>>> On 21/03/25 09:16PM, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2021/3/25 下午8:20, Neal Gompa wrote:
>>>>> On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>>>
>>>>>> This patchset can be fetched from the following github repo, along
>>>>>> with
>>>>>> the full subpage RW support:
>>>>>> https://github.com/adam900710/linux/tree/subpage
>>>>>>
>>>>>> This patchset is for metadata read write support.
>>>>>>
>>>>>> [FULL RW TEST]
>>>>>> Since the data write path is not included in this patchset, we can't
>>>>>> really test the patchset itself, but anyone can grab the patch from
>>>>>> github repo and do fstests/generic tests.
>>>>>>
>>>>>> But at least the full RW patchset can pass -g generic/quick -x defrag
>>>>>> for now.
>>>>>>
>>>>>> There are some known issues:
>>>>>>
>>>>>> - Defrag behavior change
>>>>>>     Since current defrag is doing per-page defrag, to support subpage
>>>>>>     defrag, we need some change in the loop.
>>>>>>     E.g. if a page has both hole and regular extents in it, then
>>>>>> defrag
>>>>>>     will rewrite the full 64K page.
>>>>>>
>>>>>>     Thus for now, defrag related failure is expected.
>>>>>>     But this should only cause behavior difference, no crash nor
>>>>>> hangis
>>>>>>     expected.
>>>>>>
>>>>>> - No compression support yet
>>>>>>     There are at least 2 known bugs if forcing compression for
>>>>>> subpage
>>>>>>     * Some hard coded PAGE_SIZE screwing up space rsv
>>>>>>     * Subpage ASSERT() triggered
>>>>>>       This is because some compression code is unlocking
>>>>>> locked_page by
>>>>>>       calling extent_clear_unlock_delalloc() with locked_page ==
>>>>>> NULL.
>>>>>>     So for now compression is also disabled.
>>>>>>
>>>>>> - Inode nbytes mismatch
>>>>>>     Still debugging.
>>>>>>     The fastest way to trigger is fsx using the following parameters:
>>>>>>
>>>>>>       fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file >
>>>>>> /tmp/fsx
>>>>>>
>>>>>>     Which would cause inode nbytes differs from expected value and
>>>>>>     triggers btrfs check error.
>>>>>>
>>>>>> [DIFFERENCE AGAINST REGULAR SECTORSIZE]
>>>>>> The metadata part in fact has more new code than data part, as it has
>>>>>> some different behaviors compared to the regular sector size
>>>>>> handling:
>>>>>>
>>>>>> - No more page locking
>>>>>>     Now metadata read/write relies on extent io tree locking,
>>>>>> other than
>>>>>>     page locking.
>>>>>>     This is to allow behaviors like read lock one eb while also
>>>>>> try to
>>>>>>     read lock another eb in the same page.
>>>>>>     We can't rely on page lock as now we have multiple extent
>>>>>> buffersin
>>>>>>     the same page.
>>>>>>
>>>>>> - Page status update
>>>>>>     Now we use subpage wrappers to handle page status update.
>>>>>>
>>>>>> - How to submit dirty extent buffers
>>>>>>     Instead of just grabbing extent buffer from page::private, we
>>>>>> need to
>>>>>>     iterate all dirty extent buffers in the page and submit them.
>>>>>>
>>>>>> [CHANGELOG]
>>>>>> v2:
>>>>>> - Rebased to latest misc-next
>>>>>>     No conflicts at all.
>>>>>>
>>>>>> - Add new sysfs interface to grab supported RO/RW sectorsize
>>>>>>     This will allow mkfs.btrfs to detect unmountable fs better.
>>>>>>
>>>>>> - Use newer naming schema for each patch
>>>>>>     No more "extent_io:" or "inode:" schema anymore.
>>>>>>
>>>>>> - Move two pure cleanups to the series
>>>>>>     Patch 2~3, originally in RW part.
>>>>>>
>>>>>> - Fix one uninitialized variable
>>>>>>     Patch 6.
>>>>>>
>>>>>> v3:
>>>>>> - Rename the sysfs to supported_sectorsizes
>>>>>>
>>>>>> - Rebased to latest misc-next branch
>>>>>>     This removes 2 cleanup patches.
>>>>>>
>>>>>> - Add new overview comment for subpage metadata
>>>>>>
>>>>>> Qu Wenruo (13):
>>>>>>     btrfs: add sysfs interface for supported sectorsize
>>>>>>     btrfs: use min() to replace open-code in btrfs_invalidatepage()
>>>>>>     btrfs: remove unnecessary variable shadowing in
>>>>>> btrfs_invalidatepage()
>>>>>>     btrfs: refactor how we iterate ordered extent in
>>>>>>       btrfs_invalidatepage()
>>>>>>     btrfs: introduce helpers for subpage dirty status
>>>>>>     btrfs: introduce helpers for subpage writeback status
>>>>>>     btrfs: allow btree_set_page_dirty() to do more sanity check on
>>>>>> subpage
>>>>>>       metadata
>>>>>>     btrfs: support subpage metadata csum calculation at write time
>>>>>>     btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>>>>>     btrfs: make the page uptodate assert to be subpage compatible
>>>>>>     btrfs: make set/clear_extent_buffer_dirty() to be subpage
>>>>>> compatible
>>>>>>     btrfs: make set_btree_ioerr() accept extent buffer and to be
>>>>>> subpage
>>>>>>       compatible
>>>>>>     btrfs: add subpage overview comments
>>>>>>
>>>>>>    fs/btrfs/disk-io.c   | 143
>>>>>> ++++++++++++++++++++++++++++++++++---------
>>>>>>    fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
>>>>>>    fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
>>>>>>    fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
>>>>>>    fs/btrfs/subpage.h   |  17 +++++
>>>>>>    fs/btrfs/sysfs.c     |  15 +++++
>>>>>>    6 files changed, 441 insertions(+), 116 deletions(-)
>>>>>>
>>>>>> --
>>>>>> 2.30.1
>>>>>>
>>>>>
>>>>> Why wouldn't we just integrate full read-write support with the
>>>>> caveats as described now? It seems to be relatively reasonable to do
>>>>> that, and this patch set is essentially unusable without the rest of
>>>>> it that does enable full read-write support.
>>>>
>>>> The metadata part is much more stable than data path (almost not
>>>> touched
>>>> for several months), and the metadata part already has some difference
>>>> in its behavior, which needs review.
>>>>
>>>> You point makes some sense, but I still don't believe pushing a super
>>>> large patchset does any help for the review.
>>>>
>>>> If you want to test, you can grab the branch from the github repo.
>>>> If you want to review, the mails are all here for review.
>>>>
>>>> In fact, we used to have subpage support sent as a big patchset from
>>>> IBM
>>>> guys, but the result is only some preparation patches get merged, and
>>>> nothing more.
>>>>
>>>> Using this multi-series method, we're already doing better work and
>>>> received more testing (to ensure regular sectorsize is not affected at
>>>> least).
>>>
>>> Hi Qu Wenruo,
>>>
>>> Sorry about chiming in late on this. I don't have any strong
>>> objection on either
>>> approach. Although sometime back when I tested your RW support git
>>> tree on
>>> Power, the unmount patch itself was crashing. I didn't debug it that
>>> time
>>> (this was a month back or so), so I also didn't bother testing
>>> xfstests on Power.
>>>
>>> But we do have an interest in making sure this patch series work on
>>> bs <ps
>>> on Power platform. I can try helping with testing, reviewing (to best
>>> ofmy
>>> knowledge) and fixing anything is possible :)
>>
>> That's great!
>>
>> One of my biggest problem here is, I don't have good enough testing
>> environment.
>>
>> Although SUSE has internal clouds for ARM64/PPC64, but due to the
>> f**king Great Firewall, it's super slow to access, no to mention doing
>> proper debugging.
>>
>> Currently I'm using two ARM SBCs, RK3399 and A311D based, to do the test.
>> But their computing power is far from ideal, only generic/quick can
>> finish in hours.
>>
>> Thus real world Power could definitely help.
>>>
>>> Let me try and pull your tree and test it on Power. Please let me
>>> know if there
>>> is anything needs to be taken care apart from your github tree and
>>> btrfs-progs
>>> branch with bs < ps support.
>>
>> If you're going to test the branch, here are some small notes:
>>
>> - Need to use latest btrfs-progs
>>   As it fixes a false alert on crossing 64K page boundary.
>>
>> - Need to slightly modify btrfs-progs to avoid false alerts
>>   For subpage case, mkfs.btrfs will output a warning, but that warning
>>   is outputted into stderr, which will screw up generic test groups.
>>   It's recommended to apply the following diff:
>>
>> diff --git a/common/fsfeatures.c b/common/fsfeatures.c
>> index 569208a9..21976554 100644
>> --- a/common/fsfeatures.c
>> +++ b/common/fsfeatures.c
>> @@ -341,8 +341,8 @@ int btrfs_check_sectorsize(u32 sectorsize)
>>                 return -EINVAL;
>>         }
>>         if (page_size != sectorsize)
>> -               warning(
>> -"the filesystem may not be mountable, sectorsize %u doesn't match page
>> size %u",
>> +               printf(
>> +"the filesystem may not be mountable, sectorsize %u doesn't match page
>> size %u\n",
>>                         sectorsize, page_size);
>>         return 0;
>> }
>>
>
>
>> - Xfstest/btrfs group will crash at btrfs/143
>>   Still investigating, but you can ignore btrfs group for now.
>>
>> - Very rare hang
>>   There is a very low change to hang, with "bad ordered accounting"
>>   dmesg.
>>   If you can hit, please let me know.
>>   I had something idea to fix it, but not yet in the branch.
>>
>> - btrfs inode nbytes mismatch
>>   Investigating, as it will make btrfs-check to report error.
>>
>> The last two bugs are the final show blocker, I'll give you extra
>> updates when those are fixed.
>
>   I am running the tests on aarch64 here. Are fixes for these known
>   issues posted in the ML? I can't see them yet.

Not yet, even in my subpage branch.

The problem here is completely in btrfs_invalidatepate() race against
writepage endio.

The current problem is we're using page Private2 bit to indicate if
there is any pending ordered io to be finished.

But for subpage case, just single bit in page Private2 is no longer
sufficient.

The following case can happen:

	T1			|		T2
--------------------------------+---------------------------
Page [0, 16K) dirtied		|
Page [0, 16K) delalloc start	|
|- New ordered extent created	|
|- With PagePrivate2 set	|
				|
[0, 16K) write page endio	|
|- Clear PagePrivate2		|
|- OE [0, 16K) IO_DONE		|
|- Queue finish_ordered_io()	|
    But OE [0, 16K) still in tree|
				|
Page [16K, 32K) dirtied		|
Page [0, 16K) delalloc start	|
|- New ordered extent created	|
|- With PagePrivate2 set	|
				| invalidatepage on [0, 64K)
				| |- TestClearPagePrivate2
				| |- Dec OE on range [0, 16k)
				| |  |- Underflow OE [0, 16K) <<<
				| |- Dec OE on range [16K, 32K)
				|    |- This is proper dec

In above case, in invalidatepage(), Ordered Extent [0, 16K) should not
get decreased, as its endio has finished.

Normally we rely on PagePrivate2 to prevent such problem, but for
current subpage case it doesn't have bitmap for it, and causes the problem.

The invalidatepage() part is also responsible for the inode nbytes mismatch.

IMHO, the btrfs_dec_test_.*ordered_pending() API is pure garbage.
It require callers to handle the Private2 bit and do the loop, but it
should be completely integrated into ordered extent code, not exposing
those details for callers.

I'm currently reworking the involved APIs,
btrfs_dec_test_first_ordered_pending() has been converted to subpage
friendly one and passes tests for 4K page systems.

But the btrfs_dec_test_ordered_pending() in btrfs_invalidatepage() is a
much harder hassle to handle.
Will keep working on the problem in recent days to completely solve it,
then rebase all the subpage code on the refactor ordered extent code.

Thanks,
Qu
>
> Thanks, Anand
>
>
>> Thanks,
>> Qu
>>
>>>
>>> -ritesh
>>>
>>>
>

Ritesh Harjani (IBM) April 2, 2021, 8:33 a.m. UTC | #11

On 21/03/29 10:01AM, Qu Wenruo wrote:
>
>
> On 2021/3/29 上午4:02, Ritesh Harjani wrote:
> > On 21/03/25 09:16PM, Qu Wenruo wrote:
> > >
> > >
> > > On 2021/3/25 下午8:20, Neal Gompa wrote:
> > > > On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
> > > > >
> > > > > This patchset can be fetched from the following github repo, along with
> > > > > the full subpage RW support:
> > > > > https://github.com/adam900710/linux/tree/subpage
> > > > >
> > > > > This patchset is for metadata read write support.
> > > > >
> > > > > [FULL RW TEST]
> > > > > Since the data write path is not included in this patchset, we can't
> > > > > really test the patchset itself, but anyone can grab the patch from
> > > > > github repo and do fstests/generic tests.
> > > > >
> > > > > But at least the full RW patchset can pass -g generic/quick -x defrag
> > > > > for now.
> > > > >
> > > > > There are some known issues:
> > > > >
> > > > > - Defrag behavior change
> > > > >     Since current defrag is doing per-page defrag, to support subpage
> > > > >     defrag, we need some change in the loop.
> > > > >     E.g. if a page has both hole and regular extents in it, then defrag
> > > > >     will rewrite the full 64K page.
> > > > >
> > > > >     Thus for now, defrag related failure is expected.
> > > > >     But this should only cause behavior difference, no crash nor hang is
> > > > >     expected.
> > > > >
> > > > > - No compression support yet
> > > > >     There are at least 2 known bugs if forcing compression for subpage
> > > > >     * Some hard coded PAGE_SIZE screwing up space rsv
> > > > >     * Subpage ASSERT() triggered
> > > > >       This is because some compression code is unlocking locked_page by
> > > > >       calling extent_clear_unlock_delalloc() with locked_page == NULL.
> > > > >     So for now compression is also disabled.
> > > > >
> > > > > - Inode nbytes mismatch
> > > > >     Still debugging.
> > > > >     The fastest way to trigger is fsx using the following parameters:
> > > > >
> > > > >       fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx
> > > > >
> > > > >     Which would cause inode nbytes differs from expected value and
> > > > >     triggers btrfs check error.
> > > > >
> > > > > [DIFFERENCE AGAINST REGULAR SECTORSIZE]
> > > > > The metadata part in fact has more new code than data part, as it has
> > > > > some different behaviors compared to the regular sector size handling:
> > > > >
> > > > > - No more page locking
> > > > >     Now metadata read/write relies on extent io tree locking, other than
> > > > >     page locking.
> > > > >     This is to allow behaviors like read lock one eb while also try to
> > > > >     read lock another eb in the same page.
> > > > >     We can't rely on page lock as now we have multiple extent buffers in
> > > > >     the same page.
> > > > >
> > > > > - Page status update
> > > > >     Now we use subpage wrappers to handle page status update.
> > > > >
> > > > > - How to submit dirty extent buffers
> > > > >     Instead of just grabbing extent buffer from page::private, we need to
> > > > >     iterate all dirty extent buffers in the page and submit them.
> > > > >
> > > > > [CHANGELOG]
> > > > > v2:
> > > > > - Rebased to latest misc-next
> > > > >     No conflicts at all.
> > > > >
> > > > > - Add new sysfs interface to grab supported RO/RW sectorsize
> > > > >     This will allow mkfs.btrfs to detect unmountable fs better.
> > > > >
> > > > > - Use newer naming schema for each patch
> > > > >     No more "extent_io:" or "inode:" schema anymore.
> > > > >
> > > > > - Move two pure cleanups to the series
> > > > >     Patch 2~3, originally in RW part.
> > > > >
> > > > > - Fix one uninitialized variable
> > > > >     Patch 6.
> > > > >
> > > > > v3:
> > > > > - Rename the sysfs to supported_sectorsizes
> > > > >
> > > > > - Rebased to latest misc-next branch
> > > > >     This removes 2 cleanup patches.
> > > > >
> > > > > - Add new overview comment for subpage metadata
> > > > >
> > > > > Qu Wenruo (13):
> > > > >     btrfs: add sysfs interface for supported sectorsize
> > > > >     btrfs: use min() to replace open-code in btrfs_invalidatepage()
> > > > >     btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
> > > > >     btrfs: refactor how we iterate ordered extent in
> > > > >       btrfs_invalidatepage()
> > > > >     btrfs: introduce helpers for subpage dirty status
> > > > >     btrfs: introduce helpers for subpage writeback status
> > > > >     btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
> > > > >       metadata
> > > > >     btrfs: support subpage metadata csum calculation at write time
> > > > >     btrfs: make alloc_extent_buffer() check subpage dirty bitmap
> > > > >     btrfs: make the page uptodate assert to be subpage compatible
> > > > >     btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
> > > > >     btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
> > > > >       compatible
> > > > >     btrfs: add subpage overview comments
> > > > >
> > > > >    fs/btrfs/disk-io.c   | 143 ++++++++++++++++++++++++++++++++++---------
> > > > >    fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
> > > > >    fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
> > > > >    fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
> > > > >    fs/btrfs/subpage.h   |  17 +++++
> > > > >    fs/btrfs/sysfs.c     |  15 +++++
> > > > >    6 files changed, 441 insertions(+), 116 deletions(-)
> > > > >
> > > > > --
> > > > > 2.30.1
> > > > >
> > > >
> > > > Why wouldn't we just integrate full read-write support with the
> > > > caveats as described now? It seems to be relatively reasonable to do
> > > > that, and this patch set is essentially unusable without the rest of
> > > > it that does enable full read-write support.
> > >
> > > The metadata part is much more stable than data path (almost not touched
> > > for several months), and the metadata part already has some difference
> > > in its behavior, which needs review.
> > >
> > > You point makes some sense, but I still don't believe pushing a super
> > > large patchset does any help for the review.
> > >
> > > If you want to test, you can grab the branch from the github repo.
> > > If you want to review, the mails are all here for review.
> > >
> > > In fact, we used to have subpage support sent as a big patchset from IBM
> > > guys, but the result is only some preparation patches get merged, and
> > > nothing more.
> > >
> > > Using this multi-series method, we're already doing better work and
> > > received more testing (to ensure regular sectorsize is not affected at
> > > least).
> >
> > Hi Qu Wenruo,
> >
> > Sorry about chiming in late on this. I don't have any strong objection on either
> > approach. Although sometime back when I tested your RW support git tree on
> > Power, the unmount patch itself was crashing. I didn't debug it that time
> > (this was a month back or so), so I also didn't bother testing xfstests on Power.
> >
> > But we do have an interest in making sure this patch series work on bs < ps
> > on Power platform. I can try helping with testing, reviewing (to best of my
> > knowledge) and fixing anything is possible :)
>
> That's great!
>
> One of my biggest problem here is, I don't have good enough testing
> environment.
>
> Although SUSE has internal clouds for ARM64/PPC64, but due to the
> f**king Great Firewall, it's super slow to access, no to mention doing
> proper debugging.
>
> Currently I'm using two ARM SBCs, RK3399 and A311D based, to do the test.
> But their computing power is far from ideal, only generic/quick can
> finish in hours.
>
> Thus real world Power could definitely help.
> >
> > Let me try and pull your tree and test it on Power. Please let me know if there
> > is anything needs to be taken care apart from your github tree and btrfs-progs
> > branch with bs < ps support.
>
> If you're going to test the branch, here are some small notes:
>
> - Need to use latest btrfs-progs
>   As it fixes a false alert on crossing 64K page boundary.
>
> - Need to slightly modify btrfs-progs to avoid false alerts
>   For subpage case, mkfs.btrfs will output a warning, but that warning
>   is outputted into stderr, which will screw up generic test groups.
>   It's recommended to apply the following diff:
>
> diff --git a/common/fsfeatures.c b/common/fsfeatures.c
> index 569208a9..21976554 100644
> --- a/common/fsfeatures.c
> +++ b/common/fsfeatures.c
> @@ -341,8 +341,8 @@ int btrfs_check_sectorsize(u32 sectorsize)
>                 return -EINVAL;
>         }
>         if (page_size != sectorsize)
> -               warning(
> -"the filesystem may not be mountable, sectorsize %u doesn't match page
> size %u",
> +               printf(
> +"the filesystem may not be mountable, sectorsize %u doesn't match page
> size %u\n",
>                         sectorsize, page_size);
>         return 0;
>  }
>
> - Xfstest/btrfs group will crash at btrfs/143
>   Still investigating, but you can ignore btrfs group for now.
>
> - Very rare hang
>   There is a very low change to hang, with "bad ordered accounting"
>   dmesg.
>   If you can hit, please let me know.
>   I had something idea to fix it, but not yet in the branch.
>
> - btrfs inode nbytes mismatch
>   Investigating, as it will make btrfs-check to report error.
>
> The last two bugs are the final show blocker, I'll give you extra
> updates when those are fixed.

Thanks Qu Wenruo, for above info.
I cloned below git tree as mentioned in your git log to test for RW on Power.
However, I still see that RW mount for bs < ps is disabled for in open_ctree()
https://github.com/adam900710/linux/tree/subpage

I see below code present in this tree.
         /* For 4K sector size support, it's only read-only */
         if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
                 if (!sb_rdonly(sb) || btrfs_super_log_root(disk_super)) {
                         btrfs_err(fs_info,
         "subpage sectorsize %u only supported read-only for page size %lu",
                                 sectorsize, PAGE_SIZE);
                         err = -EINVAL;
                         goto fail_alloc;
                 }
         }

Could you pls point me to the tree I can use for bs < ps testing on Power?
Sorry if I missed something.

Thanks
-ritesh

Qu Wenruo April 2, 2021, 8:36 a.m. UTC | #12

On 2021/4/2 下午4:33, Ritesh Harjani wrote:
> On 21/03/29 10:01AM, Qu Wenruo wrote:
>>
>>
>> On 2021/3/29 上午4:02, Ritesh Harjani wrote:
>>> On 21/03/25 09:16PM, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2021/3/25 下午8:20, Neal Gompa wrote:
>>>>> On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>>>
>>>>>> This patchset can be fetched from the following github repo, along with
>>>>>> the full subpage RW support:
>>>>>> https://github.com/adam900710/linux/tree/subpage
>>>>>>
>>>>>> This patchset is for metadata read write support.
>>>>>>
>>>>>> [FULL RW TEST]
>>>>>> Since the data write path is not included in this patchset, we can't
>>>>>> really test the patchset itself, but anyone can grab the patch from
>>>>>> github repo and do fstests/generic tests.
>>>>>>
>>>>>> But at least the full RW patchset can pass -g generic/quick -x defrag
>>>>>> for now.
>>>>>>
>>>>>> There are some known issues:
>>>>>>
>>>>>> - Defrag behavior change
>>>>>>      Since current defrag is doing per-page defrag, to support subpage
>>>>>>      defrag, we need some change in the loop.
>>>>>>      E.g. if a page has both hole and regular extents in it, then defrag
>>>>>>      will rewrite the full 64K page.
>>>>>>
>>>>>>      Thus for now, defrag related failure is expected.
>>>>>>      But this should only cause behavior difference, no crash nor hang is
>>>>>>      expected.
>>>>>>
>>>>>> - No compression support yet
>>>>>>      There are at least 2 known bugs if forcing compression for subpage
>>>>>>      * Some hard coded PAGE_SIZE screwing up space rsv
>>>>>>      * Subpage ASSERT() triggered
>>>>>>        This is because some compression code is unlocking locked_page by
>>>>>>        calling extent_clear_unlock_delalloc() with locked_page == NULL.
>>>>>>      So for now compression is also disabled.
>>>>>>
>>>>>> - Inode nbytes mismatch
>>>>>>      Still debugging.
>>>>>>      The fastest way to trigger is fsx using the following parameters:
>>>>>>
>>>>>>        fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx
>>>>>>
>>>>>>      Which would cause inode nbytes differs from expected value and
>>>>>>      triggers btrfs check error.
>>>>>>
>>>>>> [DIFFERENCE AGAINST REGULAR SECTORSIZE]
>>>>>> The metadata part in fact has more new code than data part, as it has
>>>>>> some different behaviors compared to the regular sector size handling:
>>>>>>
>>>>>> - No more page locking
>>>>>>      Now metadata read/write relies on extent io tree locking, other than
>>>>>>      page locking.
>>>>>>      This is to allow behaviors like read lock one eb while also try to
>>>>>>      read lock another eb in the same page.
>>>>>>      We can't rely on page lock as now we have multiple extent buffers in
>>>>>>      the same page.
>>>>>>
>>>>>> - Page status update
>>>>>>      Now we use subpage wrappers to handle page status update.
>>>>>>
>>>>>> - How to submit dirty extent buffers
>>>>>>      Instead of just grabbing extent buffer from page::private, we need to
>>>>>>      iterate all dirty extent buffers in the page and submit them.
>>>>>>
>>>>>> [CHANGELOG]
>>>>>> v2:
>>>>>> - Rebased to latest misc-next
>>>>>>      No conflicts at all.
>>>>>>
>>>>>> - Add new sysfs interface to grab supported RO/RW sectorsize
>>>>>>      This will allow mkfs.btrfs to detect unmountable fs better.
>>>>>>
>>>>>> - Use newer naming schema for each patch
>>>>>>      No more "extent_io:" or "inode:" schema anymore.
>>>>>>
>>>>>> - Move two pure cleanups to the series
>>>>>>      Patch 2~3, originally in RW part.
>>>>>>
>>>>>> - Fix one uninitialized variable
>>>>>>      Patch 6.
>>>>>>
>>>>>> v3:
>>>>>> - Rename the sysfs to supported_sectorsizes
>>>>>>
>>>>>> - Rebased to latest misc-next branch
>>>>>>      This removes 2 cleanup patches.
>>>>>>
>>>>>> - Add new overview comment for subpage metadata
>>>>>>
>>>>>> Qu Wenruo (13):
>>>>>>      btrfs: add sysfs interface for supported sectorsize
>>>>>>      btrfs: use min() to replace open-code in btrfs_invalidatepage()
>>>>>>      btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
>>>>>>      btrfs: refactor how we iterate ordered extent in
>>>>>>        btrfs_invalidatepage()
>>>>>>      btrfs: introduce helpers for subpage dirty status
>>>>>>      btrfs: introduce helpers for subpage writeback status
>>>>>>      btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
>>>>>>        metadata
>>>>>>      btrfs: support subpage metadata csum calculation at write time
>>>>>>      btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>>>>>      btrfs: make the page uptodate assert to be subpage compatible
>>>>>>      btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
>>>>>>      btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
>>>>>>        compatible
>>>>>>      btrfs: add subpage overview comments
>>>>>>
>>>>>>     fs/btrfs/disk-io.c   | 143 ++++++++++++++++++++++++++++++++++---------
>>>>>>     fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
>>>>>>     fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
>>>>>>     fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
>>>>>>     fs/btrfs/subpage.h   |  17 +++++
>>>>>>     fs/btrfs/sysfs.c     |  15 +++++
>>>>>>     6 files changed, 441 insertions(+), 116 deletions(-)
>>>>>>
>>>>>> --
>>>>>> 2.30.1
>>>>>>
>>>>>
>>>>> Why wouldn't we just integrate full read-write support with the
>>>>> caveats as described now? It seems to be relatively reasonable to do
>>>>> that, and this patch set is essentially unusable without the rest of
>>>>> it that does enable full read-write support.
>>>>
>>>> The metadata part is much more stable than data path (almost not touched
>>>> for several months), and the metadata part already has some difference
>>>> in its behavior, which needs review.
>>>>
>>>> You point makes some sense, but I still don't believe pushing a super
>>>> large patchset does any help for the review.
>>>>
>>>> If you want to test, you can grab the branch from the github repo.
>>>> If you want to review, the mails are all here for review.
>>>>
>>>> In fact, we used to have subpage support sent as a big patchset from IBM
>>>> guys, but the result is only some preparation patches get merged, and
>>>> nothing more.
>>>>
>>>> Using this multi-series method, we're already doing better work and
>>>> received more testing (to ensure regular sectorsize is not affected at
>>>> least).
>>>
>>> Hi Qu Wenruo,
>>>
>>> Sorry about chiming in late on this. I don't have any strong objection on either
>>> approach. Although sometime back when I tested your RW support git tree on
>>> Power, the unmount patch itself was crashing. I didn't debug it that time
>>> (this was a month back or so), so I also didn't bother testing xfstests on Power.
>>>
>>> But we do have an interest in making sure this patch series work on bs < ps
>>> on Power platform. I can try helping with testing, reviewing (to best of my
>>> knowledge) and fixing anything is possible :)
>>
>> That's great!
>>
>> One of my biggest problem here is, I don't have good enough testing
>> environment.
>>
>> Although SUSE has internal clouds for ARM64/PPC64, but due to the
>> f**king Great Firewall, it's super slow to access, no to mention doing
>> proper debugging.
>>
>> Currently I'm using two ARM SBCs, RK3399 and A311D based, to do the test.
>> But their computing power is far from ideal, only generic/quick can
>> finish in hours.
>>
>> Thus real world Power could definitely help.
>>>
>>> Let me try and pull your tree and test it on Power. Please let me know if there
>>> is anything needs to be taken care apart from your github tree and btrfs-progs
>>> branch with bs < ps support.
>>
>> If you're going to test the branch, here are some small notes:
>>
>> - Need to use latest btrfs-progs
>>    As it fixes a false alert on crossing 64K page boundary.
>>
>> - Need to slightly modify btrfs-progs to avoid false alerts
>>    For subpage case, mkfs.btrfs will output a warning, but that warning
>>    is outputted into stderr, which will screw up generic test groups.
>>    It's recommended to apply the following diff:
>>
>> diff --git a/common/fsfeatures.c b/common/fsfeatures.c
>> index 569208a9..21976554 100644
>> --- a/common/fsfeatures.c
>> +++ b/common/fsfeatures.c
>> @@ -341,8 +341,8 @@ int btrfs_check_sectorsize(u32 sectorsize)
>>                  return -EINVAL;
>>          }
>>          if (page_size != sectorsize)
>> -               warning(
>> -"the filesystem may not be mountable, sectorsize %u doesn't match page
>> size %u",
>> +               printf(
>> +"the filesystem may not be mountable, sectorsize %u doesn't match page
>> size %u\n",
>>                          sectorsize, page_size);
>>          return 0;
>>   }
>>
>> - Xfstest/btrfs group will crash at btrfs/143
>>    Still investigating, but you can ignore btrfs group for now.
>>
>> - Very rare hang
>>    There is a very low change to hang, with "bad ordered accounting"
>>    dmesg.
>>    If you can hit, please let me know.
>>    I had something idea to fix it, but not yet in the branch.
>>
>> - btrfs inode nbytes mismatch
>>    Investigating, as it will make btrfs-check to report error.
>>
>> The last two bugs are the final show blocker, I'll give you extra
>> updates when those are fixed.
>
> Thanks Qu Wenruo, for above info.
> I cloned below git tree as mentioned in your git log to test for RW on Power.
> However, I still see that RW mount for bs < ps is disabled for in open_ctree()
> https://github.com/adam900710/linux/tree/subpage
>
> I see below code present in this tree.
>           /* For 4K sector size support, it's only read-only */
>           if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
>                   if (!sb_rdonly(sb) || btrfs_super_log_root(disk_super)) {
>                           btrfs_err(fs_info,
>           "subpage sectorsize %u only supported read-only for page size %lu",
>                                   sectorsize, PAGE_SIZE);
>                           err = -EINVAL;
>                           goto fail_alloc;
>                   }
>           }
>
> Could you pls point me to the tree I can use for bs < ps testing on Power?
> Sorry if I missed something.

Sorry, I updated the branch to my current development progress, it's now
at the ordered extent rework part, without the remaining subpage
functionality at all.

You may want to grab this tree instead:
https://github.com/adam900710/linux/tree/subpage_old

But please keep in mind that, you may get random hang, and certain
generic test case, especially generic/075 can corrupt the inode nbytes
and leaving all later test cases using TEST_DEV to report error on fsck.

Thanks,
Qu

>
> Thanks
> -ritesh
>

Ritesh Harjani (IBM) April 2, 2021, 8:46 a.m. UTC | #13

On 21/04/02 04:36PM, Qu Wenruo wrote:
>
>
> On 2021/4/2 下午4:33, Ritesh Harjani wrote:
> > On 21/03/29 10:01AM, Qu Wenruo wrote:
> > >
> > >
> > > On 2021/3/29 上午4:02, Ritesh Harjani wrote:
> > > > On 21/03/25 09:16PM, Qu Wenruo wrote:
> > > > >
> > > > >
> > > > > On 2021/3/25 下午8:20, Neal Gompa wrote:
> > > > > > On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
> > > > > > >
> > > > > > > This patchset can be fetched from the following github repo, along with
> > > > > > > the full subpage RW support:
> > > > > > > https://github.com/adam900710/linux/tree/subpage
> > > > > > >
> > > > > > > This patchset is for metadata read write support.
> > > > > > >
> > > > > > > [FULL RW TEST]
> > > > > > > Since the data write path is not included in this patchset, we can't
> > > > > > > really test the patchset itself, but anyone can grab the patch from
> > > > > > > github repo and do fstests/generic tests.
> > > > > > >
> > > > > > > But at least the full RW patchset can pass -g generic/quick -x defrag
> > > > > > > for now.
> > > > > > >
> > > > > > > There are some known issues:
> > > > > > >
> > > > > > > - Defrag behavior change
> > > > > > >      Since current defrag is doing per-page defrag, to support subpage
> > > > > > >      defrag, we need some change in the loop.
> > > > > > >      E.g. if a page has both hole and regular extents in it, then defrag
> > > > > > >      will rewrite the full 64K page.
> > > > > > >
> > > > > > >      Thus for now, defrag related failure is expected.
> > > > > > >      But this should only cause behavior difference, no crash nor hang is
> > > > > > >      expected.
> > > > > > >
> > > > > > > - No compression support yet
> > > > > > >      There are at least 2 known bugs if forcing compression for subpage
> > > > > > >      * Some hard coded PAGE_SIZE screwing up space rsv
> > > > > > >      * Subpage ASSERT() triggered
> > > > > > >        This is because some compression code is unlocking locked_page by
> > > > > > >        calling extent_clear_unlock_delalloc() with locked_page == NULL.
> > > > > > >      So for now compression is also disabled.
> > > > > > >
> > > > > > > - Inode nbytes mismatch
> > > > > > >      Still debugging.
> > > > > > >      The fastest way to trigger is fsx using the following parameters:
> > > > > > >
> > > > > > >        fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx
> > > > > > >
> > > > > > >      Which would cause inode nbytes differs from expected value and
> > > > > > >      triggers btrfs check error.
> > > > > > >
> > > > > > > [DIFFERENCE AGAINST REGULAR SECTORSIZE]
> > > > > > > The metadata part in fact has more new code than data part, as it has
> > > > > > > some different behaviors compared to the regular sector size handling:
> > > > > > >
> > > > > > > - No more page locking
> > > > > > >      Now metadata read/write relies on extent io tree locking, other than
> > > > > > >      page locking.
> > > > > > >      This is to allow behaviors like read lock one eb while also try to
> > > > > > >      read lock another eb in the same page.
> > > > > > >      We can't rely on page lock as now we have multiple extent buffers in
> > > > > > >      the same page.
> > > > > > >
> > > > > > > - Page status update
> > > > > > >      Now we use subpage wrappers to handle page status update.
> > > > > > >
> > > > > > > - How to submit dirty extent buffers
> > > > > > >      Instead of just grabbing extent buffer from page::private, we need to
> > > > > > >      iterate all dirty extent buffers in the page and submit them.
> > > > > > >
> > > > > > > [CHANGELOG]
> > > > > > > v2:
> > > > > > > - Rebased to latest misc-next
> > > > > > >      No conflicts at all.
> > > > > > >
> > > > > > > - Add new sysfs interface to grab supported RO/RW sectorsize
> > > > > > >      This will allow mkfs.btrfs to detect unmountable fs better.
> > > > > > >
> > > > > > > - Use newer naming schema for each patch
> > > > > > >      No more "extent_io:" or "inode:" schema anymore.
> > > > > > >
> > > > > > > - Move two pure cleanups to the series
> > > > > > >      Patch 2~3, originally in RW part.
> > > > > > >
> > > > > > > - Fix one uninitialized variable
> > > > > > >      Patch 6.
> > > > > > >
> > > > > > > v3:
> > > > > > > - Rename the sysfs to supported_sectorsizes
> > > > > > >
> > > > > > > - Rebased to latest misc-next branch
> > > > > > >      This removes 2 cleanup patches.
> > > > > > >
> > > > > > > - Add new overview comment for subpage metadata
> > > > > > >
> > > > > > > Qu Wenruo (13):
> > > > > > >      btrfs: add sysfs interface for supported sectorsize
> > > > > > >      btrfs: use min() to replace open-code in btrfs_invalidatepage()
> > > > > > >      btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
> > > > > > >      btrfs: refactor how we iterate ordered extent in
> > > > > > >        btrfs_invalidatepage()
> > > > > > >      btrfs: introduce helpers for subpage dirty status
> > > > > > >      btrfs: introduce helpers for subpage writeback status
> > > > > > >      btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
> > > > > > >        metadata
> > > > > > >      btrfs: support subpage metadata csum calculation at write time
> > > > > > >      btrfs: make alloc_extent_buffer() check subpage dirty bitmap
> > > > > > >      btrfs: make the page uptodate assert to be subpage compatible
> > > > > > >      btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
> > > > > > >      btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
> > > > > > >        compatible
> > > > > > >      btrfs: add subpage overview comments
> > > > > > >
> > > > > > >     fs/btrfs/disk-io.c   | 143 ++++++++++++++++++++++++++++++++++---------
> > > > > > >     fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
> > > > > > >     fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
> > > > > > >     fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
> > > > > > >     fs/btrfs/subpage.h   |  17 +++++
> > > > > > >     fs/btrfs/sysfs.c     |  15 +++++
> > > > > > >     6 files changed, 441 insertions(+), 116 deletions(-)
> > > > > > >
> > > > > > > --
> > > > > > > 2.30.1
> > > > > > >
> > > > > >
> > > > > > Why wouldn't we just integrate full read-write support with the
> > > > > > caveats as described now? It seems to be relatively reasonable to do
> > > > > > that, and this patch set is essentially unusable without the rest of
> > > > > > it that does enable full read-write support.
> > > > >
> > > > > The metadata part is much more stable than data path (almost not touched
> > > > > for several months), and the metadata part already has some difference
> > > > > in its behavior, which needs review.
> > > > >
> > > > > You point makes some sense, but I still don't believe pushing a super
> > > > > large patchset does any help for the review.
> > > > >
> > > > > If you want to test, you can grab the branch from the github repo.
> > > > > If you want to review, the mails are all here for review.
> > > > >
> > > > > In fact, we used to have subpage support sent as a big patchset from IBM
> > > > > guys, but the result is only some preparation patches get merged, and
> > > > > nothing more.
> > > > >
> > > > > Using this multi-series method, we're already doing better work and
> > > > > received more testing (to ensure regular sectorsize is not affected at
> > > > > least).
> > > >
> > > > Hi Qu Wenruo,
> > > >
> > > > Sorry about chiming in late on this. I don't have any strong objection on either
> > > > approach. Although sometime back when I tested your RW support git tree on
> > > > Power, the unmount patch itself was crashing. I didn't debug it that time
> > > > (this was a month back or so), so I also didn't bother testing xfstests on Power.
> > > >
> > > > But we do have an interest in making sure this patch series work on bs < ps
> > > > on Power platform. I can try helping with testing, reviewing (to best of my
> > > > knowledge) and fixing anything is possible :)
> > >
> > > That's great!
> > >
> > > One of my biggest problem here is, I don't have good enough testing
> > > environment.
> > >
> > > Although SUSE has internal clouds for ARM64/PPC64, but due to the
> > > f**king Great Firewall, it's super slow to access, no to mention doing
> > > proper debugging.
> > >
> > > Currently I'm using two ARM SBCs, RK3399 and A311D based, to do the test.
> > > But their computing power is far from ideal, only generic/quick can
> > > finish in hours.
> > >
> > > Thus real world Power could definitely help.
> > > >
> > > > Let me try and pull your tree and test it on Power. Please let me know if there
> > > > is anything needs to be taken care apart from your github tree and btrfs-progs
> > > > branch with bs < ps support.
> > >
> > > If you're going to test the branch, here are some small notes:
> > >
> > > - Need to use latest btrfs-progs
> > >    As it fixes a false alert on crossing 64K page boundary.
> > >
> > > - Need to slightly modify btrfs-progs to avoid false alerts
> > >    For subpage case, mkfs.btrfs will output a warning, but that warning
> > >    is outputted into stderr, which will screw up generic test groups.
> > >    It's recommended to apply the following diff:
> > >
> > > diff --git a/common/fsfeatures.c b/common/fsfeatures.c
> > > index 569208a9..21976554 100644
> > > --- a/common/fsfeatures.c
> > > +++ b/common/fsfeatures.c
> > > @@ -341,8 +341,8 @@ int btrfs_check_sectorsize(u32 sectorsize)
> > >                  return -EINVAL;
> > >          }
> > >          if (page_size != sectorsize)
> > > -               warning(
> > > -"the filesystem may not be mountable, sectorsize %u doesn't match page
> > > size %u",
> > > +               printf(
> > > +"the filesystem may not be mountable, sectorsize %u doesn't match page
> > > size %u\n",
> > >                          sectorsize, page_size);
> > >          return 0;
> > >   }
> > >
> > > - Xfstest/btrfs group will crash at btrfs/143
> > >    Still investigating, but you can ignore btrfs group for now.
> > >
> > > - Very rare hang
> > >    There is a very low change to hang, with "bad ordered accounting"
> > >    dmesg.
> > >    If you can hit, please let me know.
> > >    I had something idea to fix it, but not yet in the branch.
> > >
> > > - btrfs inode nbytes mismatch
> > >    Investigating, as it will make btrfs-check to report error.
> > >
> > > The last two bugs are the final show blocker, I'll give you extra
> > > updates when those are fixed.
> >
> > Thanks Qu Wenruo, for above info.
> > I cloned below git tree as mentioned in your git log to test for RW on Power.
> > However, I still see that RW mount for bs < ps is disabled for in open_ctree()
> > https://github.com/adam900710/linux/tree/subpage
> >
> > I see below code present in this tree.
> >           /* For 4K sector size support, it's only read-only */
> >           if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
> >                   if (!sb_rdonly(sb) || btrfs_super_log_root(disk_super)) {
> >                           btrfs_err(fs_info,
> >           "subpage sectorsize %u only supported read-only for page size %lu",
> >                                   sectorsize, PAGE_SIZE);
> >                           err = -EINVAL;
> >                           goto fail_alloc;
> >                   }
> >           }
> >
> > Could you pls point me to the tree I can use for bs < ps testing on Power?
> > Sorry if I missed something.
>
> Sorry, I updated the branch to my current development progress, it's now
> at the ordered extent rework part, without the remaining subpage
> functionality at all.
>
> You may want to grab this tree instead:
> https://github.com/adam900710/linux/tree/subpage_old
>
> But please keep in mind that, you may get random hang, and certain
> generic test case, especially generic/075 can corrupt the inode nbytes
> and leaving all later test cases using TEST_DEV to report error on fsck.
>

Thanks for quick response. Sure, I will exclude generic/075 from the test
for now.

-ritesh

Qu Wenruo April 2, 2021, 8:52 a.m. UTC | #14

On 2021/4/2 下午4:46, Ritesh Harjani wrote:
> On 21/04/02 04:36PM, Qu Wenruo wrote:
>>
>>
>> On 2021/4/2 下午4:33, Ritesh Harjani wrote:
>>> On 21/03/29 10:01AM, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2021/3/29 上午4:02, Ritesh Harjani wrote:
>>>>> On 21/03/25 09:16PM, Qu Wenruo wrote:
>>>>>>
>>>>>>
>>>>>> On 2021/3/25 下午8:20, Neal Gompa wrote:
>>>>>>> On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>>>>>
>>>>>>>> This patchset can be fetched from the following github repo, along with
>>>>>>>> the full subpage RW support:
>>>>>>>> https://github.com/adam900710/linux/tree/subpage
>>>>>>>>
>>>>>>>> This patchset is for metadata read write support.
>>>>>>>>
>>>>>>>> [FULL RW TEST]
>>>>>>>> Since the data write path is not included in this patchset, we can't
>>>>>>>> really test the patchset itself, but anyone can grab the patch from
>>>>>>>> github repo and do fstests/generic tests.
>>>>>>>>
>>>>>>>> But at least the full RW patchset can pass -g generic/quick -x defrag
>>>>>>>> for now.
>>>>>>>>
>>>>>>>> There are some known issues:
>>>>>>>>
>>>>>>>> - Defrag behavior change
>>>>>>>>       Since current defrag is doing per-page defrag, to support subpage
>>>>>>>>       defrag, we need some change in the loop.
>>>>>>>>       E.g. if a page has both hole and regular extents in it, then defrag
>>>>>>>>       will rewrite the full 64K page.
>>>>>>>>
>>>>>>>>       Thus for now, defrag related failure is expected.
>>>>>>>>       But this should only cause behavior difference, no crash nor hang is
>>>>>>>>       expected.
>>>>>>>>
>>>>>>>> - No compression support yet
>>>>>>>>       There are at least 2 known bugs if forcing compression for subpage
>>>>>>>>       * Some hard coded PAGE_SIZE screwing up space rsv
>>>>>>>>       * Subpage ASSERT() triggered
>>>>>>>>         This is because some compression code is unlocking locked_page by
>>>>>>>>         calling extent_clear_unlock_delalloc() with locked_page == NULL.
>>>>>>>>       So for now compression is also disabled.
>>>>>>>>
>>>>>>>> - Inode nbytes mismatch
>>>>>>>>       Still debugging.
>>>>>>>>       The fastest way to trigger is fsx using the following parameters:
>>>>>>>>
>>>>>>>>         fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx
>>>>>>>>
>>>>>>>>       Which would cause inode nbytes differs from expected value and
>>>>>>>>       triggers btrfs check error.
>>>>>>>>
>>>>>>>> [DIFFERENCE AGAINST REGULAR SECTORSIZE]
>>>>>>>> The metadata part in fact has more new code than data part, as it has
>>>>>>>> some different behaviors compared to the regular sector size handling:
>>>>>>>>
>>>>>>>> - No more page locking
>>>>>>>>       Now metadata read/write relies on extent io tree locking, other than
>>>>>>>>       page locking.
>>>>>>>>       This is to allow behaviors like read lock one eb while also try to
>>>>>>>>       read lock another eb in the same page.
>>>>>>>>       We can't rely on page lock as now we have multiple extent buffers in
>>>>>>>>       the same page.
>>>>>>>>
>>>>>>>> - Page status update
>>>>>>>>       Now we use subpage wrappers to handle page status update.
>>>>>>>>
>>>>>>>> - How to submit dirty extent buffers
>>>>>>>>       Instead of just grabbing extent buffer from page::private, we need to
>>>>>>>>       iterate all dirty extent buffers in the page and submit them.
>>>>>>>>
>>>>>>>> [CHANGELOG]
>>>>>>>> v2:
>>>>>>>> - Rebased to latest misc-next
>>>>>>>>       No conflicts at all.
>>>>>>>>
>>>>>>>> - Add new sysfs interface to grab supported RO/RW sectorsize
>>>>>>>>       This will allow mkfs.btrfs to detect unmountable fs better.
>>>>>>>>
>>>>>>>> - Use newer naming schema for each patch
>>>>>>>>       No more "extent_io:" or "inode:" schema anymore.
>>>>>>>>
>>>>>>>> - Move two pure cleanups to the series
>>>>>>>>       Patch 2~3, originally in RW part.
>>>>>>>>
>>>>>>>> - Fix one uninitialized variable
>>>>>>>>       Patch 6.
>>>>>>>>
>>>>>>>> v3:
>>>>>>>> - Rename the sysfs to supported_sectorsizes
>>>>>>>>
>>>>>>>> - Rebased to latest misc-next branch
>>>>>>>>       This removes 2 cleanup patches.
>>>>>>>>
>>>>>>>> - Add new overview comment for subpage metadata
>>>>>>>>
>>>>>>>> Qu Wenruo (13):
>>>>>>>>       btrfs: add sysfs interface for supported sectorsize
>>>>>>>>       btrfs: use min() to replace open-code in btrfs_invalidatepage()
>>>>>>>>       btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
>>>>>>>>       btrfs: refactor how we iterate ordered extent in
>>>>>>>>         btrfs_invalidatepage()
>>>>>>>>       btrfs: introduce helpers for subpage dirty status
>>>>>>>>       btrfs: introduce helpers for subpage writeback status
>>>>>>>>       btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
>>>>>>>>         metadata
>>>>>>>>       btrfs: support subpage metadata csum calculation at write time
>>>>>>>>       btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>>>>>>>       btrfs: make the page uptodate assert to be subpage compatible
>>>>>>>>       btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
>>>>>>>>       btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
>>>>>>>>         compatible
>>>>>>>>       btrfs: add subpage overview comments
>>>>>>>>
>>>>>>>>      fs/btrfs/disk-io.c   | 143 ++++++++++++++++++++++++++++++++++---------
>>>>>>>>      fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
>>>>>>>>      fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
>>>>>>>>      fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
>>>>>>>>      fs/btrfs/subpage.h   |  17 +++++
>>>>>>>>      fs/btrfs/sysfs.c     |  15 +++++
>>>>>>>>      6 files changed, 441 insertions(+), 116 deletions(-)
>>>>>>>>
>>>>>>>> --
>>>>>>>> 2.30.1
>>>>>>>>
>>>>>>>
>>>>>>> Why wouldn't we just integrate full read-write support with the
>>>>>>> caveats as described now? It seems to be relatively reasonable to do
>>>>>>> that, and this patch set is essentially unusable without the rest of
>>>>>>> it that does enable full read-write support.
>>>>>>
>>>>>> The metadata part is much more stable than data path (almost not touched
>>>>>> for several months), and the metadata part already has some difference
>>>>>> in its behavior, which needs review.
>>>>>>
>>>>>> You point makes some sense, but I still don't believe pushing a super
>>>>>> large patchset does any help for the review.
>>>>>>
>>>>>> If you want to test, you can grab the branch from the github repo.
>>>>>> If you want to review, the mails are all here for review.
>>>>>>
>>>>>> In fact, we used to have subpage support sent as a big patchset from IBM
>>>>>> guys, but the result is only some preparation patches get merged, and
>>>>>> nothing more.
>>>>>>
>>>>>> Using this multi-series method, we're already doing better work and
>>>>>> received more testing (to ensure regular sectorsize is not affected at
>>>>>> least).
>>>>>
>>>>> Hi Qu Wenruo,
>>>>>
>>>>> Sorry about chiming in late on this. I don't have any strong objection on either
>>>>> approach. Although sometime back when I tested your RW support git tree on
>>>>> Power, the unmount patch itself was crashing. I didn't debug it that time
>>>>> (this was a month back or so), so I also didn't bother testing xfstests on Power.
>>>>>
>>>>> But we do have an interest in making sure this patch series work on bs < ps
>>>>> on Power platform. I can try helping with testing, reviewing (to best of my
>>>>> knowledge) and fixing anything is possible :)
>>>>
>>>> That's great!
>>>>
>>>> One of my biggest problem here is, I don't have good enough testing
>>>> environment.
>>>>
>>>> Although SUSE has internal clouds for ARM64/PPC64, but due to the
>>>> f**king Great Firewall, it's super slow to access, no to mention doing
>>>> proper debugging.
>>>>
>>>> Currently I'm using two ARM SBCs, RK3399 and A311D based, to do the test.
>>>> But their computing power is far from ideal, only generic/quick can
>>>> finish in hours.
>>>>
>>>> Thus real world Power could definitely help.
>>>>>
>>>>> Let me try and pull your tree and test it on Power. Please let me know if there
>>>>> is anything needs to be taken care apart from your github tree and btrfs-progs
>>>>> branch with bs < ps support.
>>>>
>>>> If you're going to test the branch, here are some small notes:
>>>>
>>>> - Need to use latest btrfs-progs
>>>>     As it fixes a false alert on crossing 64K page boundary.
>>>>
>>>> - Need to slightly modify btrfs-progs to avoid false alerts
>>>>     For subpage case, mkfs.btrfs will output a warning, but that warning
>>>>     is outputted into stderr, which will screw up generic test groups.
>>>>     It's recommended to apply the following diff:
>>>>
>>>> diff --git a/common/fsfeatures.c b/common/fsfeatures.c
>>>> index 569208a9..21976554 100644
>>>> --- a/common/fsfeatures.c
>>>> +++ b/common/fsfeatures.c
>>>> @@ -341,8 +341,8 @@ int btrfs_check_sectorsize(u32 sectorsize)
>>>>                   return -EINVAL;
>>>>           }
>>>>           if (page_size != sectorsize)
>>>> -               warning(
>>>> -"the filesystem may not be mountable, sectorsize %u doesn't match page
>>>> size %u",
>>>> +               printf(
>>>> +"the filesystem may not be mountable, sectorsize %u doesn't match page
>>>> size %u\n",
>>>>                           sectorsize, page_size);
>>>>           return 0;
>>>>    }
>>>>
>>>> - Xfstest/btrfs group will crash at btrfs/143
>>>>     Still investigating, but you can ignore btrfs group for now.
>>>>
>>>> - Very rare hang
>>>>     There is a very low change to hang, with "bad ordered accounting"
>>>>     dmesg.
>>>>     If you can hit, please let me know.
>>>>     I had something idea to fix it, but not yet in the branch.
>>>>
>>>> - btrfs inode nbytes mismatch
>>>>     Investigating, as it will make btrfs-check to report error.
>>>>
>>>> The last two bugs are the final show blocker, I'll give you extra
>>>> updates when those are fixed.
>>>
>>> Thanks Qu Wenruo, for above info.
>>> I cloned below git tree as mentioned in your git log to test for RW on Power.
>>> However, I still see that RW mount for bs < ps is disabled for in open_ctree()
>>> https://github.com/adam900710/linux/tree/subpage
>>>
>>> I see below code present in this tree.
>>>            /* For 4K sector size support, it's only read-only */
>>>            if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
>>>                    if (!sb_rdonly(sb) || btrfs_super_log_root(disk_super)) {
>>>                            btrfs_err(fs_info,
>>>            "subpage sectorsize %u only supported read-only for page size %lu",
>>>                                    sectorsize, PAGE_SIZE);
>>>                            err = -EINVAL;
>>>                            goto fail_alloc;
>>>                    }
>>>            }
>>>
>>> Could you pls point me to the tree I can use for bs < ps testing on Power?
>>> Sorry if I missed something.
>>
>> Sorry, I updated the branch to my current development progress, it's now
>> at the ordered extent rework part, without the remaining subpage
>> functionality at all.
>>
>> You may want to grab this tree instead:
>> https://github.com/adam900710/linux/tree/subpage_old
>>
>> But please keep in mind that, you may get random hang, and certain
>> generic test case, especially generic/075 can corrupt the inode nbytes
>> and leaving all later test cases using TEST_DEV to report error on fsck.
>>
>
> Thanks for quick response. Sure, I will exclude generic/075 from the test
> for now.

Not only generic/075, but all tests running fsx may cause inode nbytes
corruption.

Thus I'd recommend either modify btrfs-check to ignore it, or re-mkfs on
TEST_DEV after each test case.

Thanks,
Qu

>
> -ritesh
>

David Sterba April 3, 2021, 11:08 a.m. UTC | #15

On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
> This patchset can be fetched from the following github repo, along with
> the full subpage RW support:
> https://github.com/adam900710/linux/tree/subpage
> 
> This patchset is for metadata read write support.

> Qu Wenruo (13):
>   btrfs: add sysfs interface for supported sectorsize
>   btrfs: use min() to replace open-code in btrfs_invalidatepage()
>   btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
>   btrfs: refactor how we iterate ordered extent in
>     btrfs_invalidatepage()
>   btrfs: introduce helpers for subpage dirty status
>   btrfs: introduce helpers for subpage writeback status
>   btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
>     metadata
>   btrfs: support subpage metadata csum calculation at write time
>   btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>   btrfs: make the page uptodate assert to be subpage compatible
>   btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
>   btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
>     compatible
>   btrfs: add subpage overview comments

Moved from topic branch to misc-next.

Qu Wenruo April 5, 2021, 6:14 a.m. UTC | #16

On 2021/4/3 下午7:08, David Sterba wrote:
> On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
>> This patchset can be fetched from the following github repo, along with
>> the full subpage RW support:
>> https://github.com/adam900710/linux/tree/subpage
>>
>> This patchset is for metadata read write support.
>
>> Qu Wenruo (13):
>>    btrfs: add sysfs interface for supported sectorsize
>>    btrfs: use min() to replace open-code in btrfs_invalidatepage()
>>    btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
>>    btrfs: refactor how we iterate ordered extent in
>>      btrfs_invalidatepage()
>>    btrfs: introduce helpers for subpage dirty status
>>    btrfs: introduce helpers for subpage writeback status
>>    btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
>>      metadata
>>    btrfs: support subpage metadata csum calculation at write time
>>    btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>    btrfs: make the page uptodate assert to be subpage compatible
>>    btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
>>    btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
>>      compatible
>>    btrfs: add subpage overview comments
>
> Moved from topic branch to misc-next.
>

Note sure if it's too late, but I inserted the last comment patch into
the wrong location.

In fact, there are 4 more patches to make subpage metadata RW really work:
  btrfs: make lock_extent_buffer_for_io() to be subpage compatible
  btrfs: introduce submit_eb_subpage() to submit a subpage metadata page
  btrfs: introduce end_bio_subpage_eb_writepage() function
  btrfs: introduce write_one_subpage_eb() function

Those 4 patches should be before the final comment patch.

Should I just send the 4 patches in a separate series?

Sorry for the bad split, it looks like multi-series patches indeed has
such problem...

Thanks,
Qu

Anand Jain April 6, 2021, 2:31 a.m. UTC | #17

On 05/04/2021 14:14, Qu Wenruo wrote:
> 
> 
> On 2021/4/3 下午7:08, David Sterba wrote:
>> On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
>>> This patchset can be fetched from the following github repo, along with
>>> the full subpage RW support:
>>> https://github.com/adam900710/linux/tree/subpage
>>>
>>> This patchset is for metadata read write support.
>>
>>> Qu Wenruo (13):
>>>    btrfs: add sysfs interface for supported sectorsize
>>>    btrfs: use min() to replace open-code in btrfs_invalidatepage()
>>>    btrfs: remove unnecessary variable shadowing in 
>>> btrfs_invalidatepage()
>>>    btrfs: refactor how we iterate ordered extent in
>>>      btrfs_invalidatepage()
>>>    btrfs: introduce helpers for subpage dirty status
>>>    btrfs: introduce helpers for subpage writeback status
>>>    btrfs: allow btree_set_page_dirty() to do more sanity check on 
>>> subpage
>>>      metadata
>>>    btrfs: support subpage metadata csum calculation at write time
>>>    btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>>    btrfs: make the page uptodate assert to be subpage compatible
>>>    btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
>>>    btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
>>>      compatible
>>>    btrfs: add subpage overview comments
>>
>> Moved from topic branch to misc-next.
>>
> 
> Note sure if it's too late, but I inserted the last comment patch into
> the wrong location.
> 
> In fact, there are 4 more patches to make


> subpage metadata RW really work:

  I took some time to go through these patches, which are lined up for
  integration.

  With this set of patches that are being integrated, we don't yet
  support RW mount of filesystem if PAGESIZE > sectorsize as a whole.
  Subpage metadata RW support, how is it to be used in the production?
  OR How is this supposed to be tested?

  OR should you just cleanup the title as preparatory patches to support
  subpage RW? It is confusing.

Thanks, Anand


> btrfs: make lock_extent_buffer_for_io() to be subpage compatible
> btrfs: introduce submit_eb_subpage() to submit a subpage metadata page
> btrfs: introduce end_bio_subpage_eb_writepage() function
> btrfs: introduce write_one_subpage_eb() function
> 
> Those 4 patches should be before the final comment patch.
> 
> Should I just send the 4 patches in a separate series?
> 
> Sorry for the bad split, it looks like multi-series patches indeed has
> such problem...
> 
> Thanks,
> Qu

David Sterba April 6, 2021, 7:13 p.m. UTC | #18

On Mon, Apr 05, 2021 at 02:14:34PM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/4/3 下午7:08, David Sterba wrote:
> > On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
> >> This patchset can be fetched from the following github repo, along with
> >> the full subpage RW support:
> >> https://github.com/adam900710/linux/tree/subpage
> >>
> >> This patchset is for metadata read write support.
> >
> >> Qu Wenruo (13):
> >>    btrfs: add sysfs interface for supported sectorsize
> >>    btrfs: use min() to replace open-code in btrfs_invalidatepage()
> >>    btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
> >>    btrfs: refactor how we iterate ordered extent in
> >>      btrfs_invalidatepage()
> >>    btrfs: introduce helpers for subpage dirty status
> >>    btrfs: introduce helpers for subpage writeback status
> >>    btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
> >>      metadata
> >>    btrfs: support subpage metadata csum calculation at write time
> >>    btrfs: make alloc_extent_buffer() check subpage dirty bitmap
> >>    btrfs: make the page uptodate assert to be subpage compatible
> >>    btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
> >>    btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
> >>      compatible
> >>    btrfs: add subpage overview comments
> >
> > Moved from topic branch to misc-next.
> >
> 
> Note sure if it's too late, but I inserted the last comment patch into
> the wrong location.

Not late yet but getting very close to the pre-merge window code freeze.

> In fact, there are 4 more patches to make subpage metadata RW really work:
>   btrfs: make lock_extent_buffer_for_io() to be subpage compatible
>   btrfs: introduce submit_eb_subpage() to submit a subpage metadata page
>   btrfs: introduce end_bio_subpage_eb_writepage() function
>   btrfs: introduce write_one_subpage_eb() function
> 
> Those 4 patches should be before the final comment patch.
> 
> Should I just send the 4 patches in a separate series?

As they've been posted now, I'll add them to for-next and reorder before
the last patch with comment, after some testing.

> Sorry for the bad split, it looks like multi-series patches indeed has
> such problem...

Yeah, but so far it's been all fixable given the scope of the whole
subpage support.

David Sterba April 6, 2021, 7:20 p.m. UTC | #19

On Tue, Apr 06, 2021 at 10:31:58AM +0800, Anand Jain wrote:
> On 05/04/2021 14:14, Qu Wenruo wrote:
> > 
> > 
> > On 2021/4/3 下午7:08, David Sterba wrote:
> >> On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
> >>> This patchset can be fetched from the following github repo, along with
> >>> the full subpage RW support:
> >>> https://github.com/adam900710/linux/tree/subpage
> >>>
> >>> This patchset is for metadata read write support.
> >>
> >>> Qu Wenruo (13):
> >>>    btrfs: add sysfs interface for supported sectorsize
> >>>    btrfs: use min() to replace open-code in btrfs_invalidatepage()
> >>>    btrfs: remove unnecessary variable shadowing in 
> >>> btrfs_invalidatepage()
> >>>    btrfs: refactor how we iterate ordered extent in
> >>>      btrfs_invalidatepage()
> >>>    btrfs: introduce helpers for subpage dirty status
> >>>    btrfs: introduce helpers for subpage writeback status
> >>>    btrfs: allow btree_set_page_dirty() to do more sanity check on 
> >>> subpage
> >>>      metadata
> >>>    btrfs: support subpage metadata csum calculation at write time
> >>>    btrfs: make alloc_extent_buffer() check subpage dirty bitmap
> >>>    btrfs: make the page uptodate assert to be subpage compatible
> >>>    btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
> >>>    btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
> >>>      compatible
> >>>    btrfs: add subpage overview comments
> >>
> >> Moved from topic branch to misc-next.
> >>
> > 
> > Note sure if it's too late, but I inserted the last comment patch into
> > the wrong location.
> > 
> > In fact, there are 4 more patches to make
> 
> 
> > subpage metadata RW really work:
> 
>   I took some time to go through these patches, which are lined up for
>   integration.
> 
>   With this set of patches that are being integrated, we don't yet
>   support RW mount of filesystem if PAGESIZE > sectorsize as a whole.
>   Subpage metadata RW support, how is it to be used in the production?
>   OR How is this supposed to be tested?

What gets merged to misc-next is incrementally adding support for the
whole subpage feature. This would quite hard to get in in one go so it's
been split to patchsets with known limitations. Qu lists what works and
what does not in the cover letter.

With known missing functionality it obviously can't be used for
production, just for testing. There are likely patches in Qu's
development branches and more patches still to be written, so even the
testing is partial with known failures or bugs.

>   OR should you just cleanup the title as preparatory patches to support
>   subpage RW? It is confusing.

I think the title says what the patchset does, adding rw support for
metadata, in a sense it's still preparatory, yes.

Qu Wenruo April 6, 2021, 11:59 p.m. UTC | #20

On 2021/4/6 上午10:31, Anand Jain wrote:
> On 05/04/2021 14:14, Qu Wenruo wrote:
>>
>>
>> On 2021/4/3 下午7:08, David Sterba wrote:
>>> On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
>>>> This patchset can be fetched from the following github repo, along with
>>>> the full subpage RW support:
>>>> https://github.com/adam900710/linux/tree/subpage
>>>>
>>>> This patchset is for metadata read write support.
>>>
>>>> Qu Wenruo (13):
>>>>    btrfs: add sysfs interface for supported sectorsize
>>>>    btrfs: use min() to replace open-code in btrfs_invalidatepage()
>>>>    btrfs: remove unnecessary variable shadowing in
>>>> btrfs_invalidatepage()
>>>>    btrfs: refactor how we iterate ordered extent in
>>>>      btrfs_invalidatepage()
>>>>    btrfs: introduce helpers for subpage dirty status
>>>>    btrfs: introduce helpers for subpage writeback status
>>>>    btrfs: allow btree_set_page_dirty() to do more sanity check on
>>>> subpage
>>>>      metadata
>>>>    btrfs: support subpage metadata csum calculation at write time
>>>>    btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>>>    btrfs: make the page uptodate assert to be subpage compatible
>>>>    btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
>>>>    btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
>>>>      compatible
>>>>    btrfs: add subpage overview comments
>>>
>>> Moved from topic branch to misc-next.
>>>
>>
>> Note sure if it's too late, but I inserted the last comment patch into
>> the wrong location.
>>
>> In fact, there are 4 more patches to make
>
>
>> subpage metadata RW really work:
>
> I took some time to go through these patches, which are lined up for
> integration.
>
> With this set of patches that are being integrated, we don't yet
> support RW mount of filesystem if PAGESIZE > sectorsize as a whole.
> Subpage metadata RW support, how is it to be used in the production?

I'd say, without the ability to write subpage metadata, how would
subpage even be utilized in production environment?

> OR How is this supposed to be tested?

There are two ways:
- Craft some scripts to only do metadata operations without any data
   writes

- Wait for my data write support then run regular full test suites

I used to go method 1, but since in my local branch it's already full
subpage RW support, I'm doing method 2.

Although it exposes quite some bugs in data write path, it has been
quite a long time after last metadata related bug.

>
> OR should you just cleanup the title as preparatory patches to support
> subpage RW? It is confusing.

Well, considering this is the last patchset before full subpage RW, such
"preparatory" mention would be saved for next big function add.
(Thankfully, there is no such plan yet)

Thanks,
Qu

>
> Thanks, Anand
>
>
>> btrfs: make lock_extent_buffer_for_io() to be subpage compatible
>> btrfs: introduce submit_eb_subpage() to submit a subpage metadata page
>> btrfs: introduce end_bio_subpage_eb_writepage() function
>> btrfs: introduce write_one_subpage_eb() function
>>
>> Those 4 patches should be before the final comment patch.
>>
>> Should I just send the 4 patches in a separate series?
>>
>> Sorry for the bad split, it looks like multi-series patches indeed has
>> such problem...
>>
>> Thanks,
>> Qu
>

Qu Wenruo April 12, 2021, 11:33 a.m. UTC | #21

On 2021/4/2 下午4:52, Qu Wenruo wrote:
>
>
> On 2021/4/2 下午4:46, Ritesh Harjani wrote:
>> On 21/04/02 04:36PM, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/4/2 下午4:33, Ritesh Harjani wrote:
>>>> On 21/03/29 10:01AM, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> On 2021/3/29 上午4:02, Ritesh Harjani wrote:
>>>>>> On 21/03/25 09:16PM, Qu Wenruo wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2021/3/25 下午8:20, Neal Gompa wrote:
>>>>>>>> On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>>>>>>
>>>>>>>>> This patchset can be fetched from the following github repo,
>>>>>>>>> along with
>>>>>>>>> the full subpage RW support:
>>>>>>>>> https://github.com/adam900710/linux/tree/subpage
>>>>>>>>>
>>>>>>>>> This patchset is for metadata read write support.
>>>>>>>>>
>>>>>>>>> [FULL RW TEST]
>>>>>>>>> Since the data write path is not included in this patchset, we
>>>>>>>>> can't
>>>>>>>>> really test the patchset itself, but anyone can grab the patch
>>>>>>>>> from
>>>>>>>>> github repo and do fstests/generic tests.
>>>>>>>>>
>>>>>>>>> But at least the full RW patchset can pass -g generic/quick -x
>>>>>>>>> defrag
>>>>>>>>> for now.
>>>>>>>>>
>>>>>>>>> There are some known issues:
>>>>>>>>>
>>>>>>>>> - Defrag behavior change
>>>>>>>>>       Since current defrag is doing per-page defrag, to support
>>>>>>>>> subpage
>>>>>>>>>       defrag, we need some change in the loop.
>>>>>>>>>       E.g. if a page has both hole and regular extents in it,
>>>>>>>>> then defrag
>>>>>>>>>       will rewrite the full 64K page.
>>>>>>>>>
>>>>>>>>>       Thus for now, defrag related failure is expected.
>>>>>>>>>       But this should only cause behavior difference, no crash
>>>>>>>>> nor hang is
>>>>>>>>>       expected.
>>>>>>>>>
>>>>>>>>> - No compression support yet
>>>>>>>>>       There are at least 2 known bugs if forcing compression
>>>>>>>>> for subpage
>>>>>>>>>       * Some hard coded PAGE_SIZE screwing up space rsv
>>>>>>>>>       * Subpage ASSERT() triggered
>>>>>>>>>         This is because some compression code is unlocking
>>>>>>>>> locked_page by
>>>>>>>>>         calling extent_clear_unlock_delalloc() with locked_page
>>>>>>>>> == NULL.
>>>>>>>>>       So for now compression is also disabled.
>>>>>>>>>
>>>>>>>>> - Inode nbytes mismatch
>>>>>>>>>       Still debugging.
>>>>>>>>>       The fastest way to trigger is fsx using the following
>>>>>>>>> parameters:
>>>>>>>>>
>>>>>>>>>         fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file
>>>>>>>>> > /tmp/fsx
>>>>>>>>>
>>>>>>>>>       Which would cause inode nbytes differs from expected
>>>>>>>>> value and
>>>>>>>>>       triggers btrfs check error.
>>>>>>>>>
>>>>>>>>> [DIFFERENCE AGAINST REGULAR SECTORSIZE]
>>>>>>>>> The metadata part in fact has more new code than data part, as
>>>>>>>>> ithas
>>>>>>>>> some different behaviors compared to the regular sector size
>>>>>>>>> handling:
>>>>>>>>>
>>>>>>>>> - No more page locking
>>>>>>>>>       Now metadata read/write relies on extent io tree locking,
>>>>>>>>> other than
>>>>>>>>>       page locking.
>>>>>>>>>       This is to allow behaviors like read lock one eb while
>>>>>>>>> alsotry to
>>>>>>>>>       read lock another eb in the same page.
>>>>>>>>>       We can't rely on page lock as now we have multiple extent
>>>>>>>>> buffers in
>>>>>>>>>       the same page.
>>>>>>>>>
>>>>>>>>> - Page status update
>>>>>>>>>       Now we use subpage wrappers to handle page status update.
>>>>>>>>>
>>>>>>>>> - How to submit dirty extent buffers
>>>>>>>>>       Instead of just grabbing extent buffer from
>>>>>>>>> page::private, we need to
>>>>>>>>>       iterate all dirty extent buffers in the page and submit
>>>>>>>>> them.
>>>>>>>>>
>>>>>>>>> [CHANGELOG]
>>>>>>>>> v2:
>>>>>>>>> - Rebased to latest misc-next
>>>>>>>>>       No conflicts at all.
>>>>>>>>>
>>>>>>>>> - Add new sysfs interface to grab supported RO/RW sectorsize
>>>>>>>>>       This will allow mkfs.btrfs to detect unmountable fs better.
>>>>>>>>>
>>>>>>>>> - Use newer naming schema for each patch
>>>>>>>>>       No more "extent_io:" or "inode:" schema anymore.
>>>>>>>>>
>>>>>>>>> - Move two pure cleanups to the series
>>>>>>>>>       Patch 2~3, originally in RW part.
>>>>>>>>>
>>>>>>>>> - Fix one uninitialized variable
>>>>>>>>>       Patch 6.
>>>>>>>>>
>>>>>>>>> v3:
>>>>>>>>> - Rename the sysfs to supported_sectorsizes
>>>>>>>>>
>>>>>>>>> - Rebased to latest misc-next branch
>>>>>>>>>       This removes 2 cleanup patches.
>>>>>>>>>
>>>>>>>>> - Add new overview comment for subpage metadata
>>>>>>>>>
>>>>>>>>> Qu Wenruo (13):
>>>>>>>>>       btrfs: add sysfs interface for supported sectorsize
>>>>>>>>>       btrfs: use min() to replace open-code in
>>>>>>>>> btrfs_invalidatepage()
>>>>>>>>>       btrfs: remove unnecessary variable shadowing in
>>>>>>>>> btrfs_invalidatepage()
>>>>>>>>>       btrfs: refactor how we iterate ordered extent in
>>>>>>>>>         btrfs_invalidatepage()
>>>>>>>>>       btrfs: introduce helpers for subpage dirty status
>>>>>>>>>       btrfs: introduce helpers for subpage writeback status
>>>>>>>>>       btrfs: allow btree_set_page_dirty() to do more sanity
>>>>>>>>> checkon subpage
>>>>>>>>>         metadata
>>>>>>>>>       btrfs: support subpage metadata csum calculation at write
>>>>>>>>> time
>>>>>>>>>       btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>>>>>>>>>       btrfs: make the page uptodate assert to be subpage
>>>>>>>>> compatible
>>>>>>>>>       btrfs: make set/clear_extent_buffer_dirty() to be subpage
>>>>>>>>> compatible
>>>>>>>>>       btrfs: make set_btree_ioerr() accept extent buffer and to
>>>>>>>>> be subpage
>>>>>>>>>         compatible
>>>>>>>>>       btrfs: add subpage overview comments
>>>>>>>>>
>>>>>>>>>      fs/btrfs/disk-io.c   | 143
>>>>>>>>> ++++++++++++++++++++++++++++++++++---------
>>>>>>>>>      fs/btrfs/extent_io.c | 127
>>>>>>>>> ++++++++++++++++++++++++++++----------
>>>>>>>>>      fs/btrfs/inode.c     | 128
>>>>>>>>> ++++++++++++++++++++++----------------
>>>>>>>>>      fs/btrfs/subpage.c   | 127
>>>>>>>>> ++++++++++++++++++++++++++++++++++++++
>>>>>>>>>      fs/btrfs/subpage.h   |  17 +++++
>>>>>>>>>      fs/btrfs/sysfs.c     |  15 +++++
>>>>>>>>>      6 files changed, 441 insertions(+), 116 deletions(-)
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> 2.30.1
>>>>>>>>>
>>>>>>>>
>>>>>>>> Why wouldn't we just integrate full read-write support with the
>>>>>>>> caveats as described now? It seems to be relatively reasonable
>>>>>>>> to do
>>>>>>>> that, and this patch set is essentially unusable without the
>>>>>>>> rest of
>>>>>>>> it that does enable full read-write support.
>>>>>>>
>>>>>>> The metadata part is much more stable than data path (almost not
>>>>>>> touched
>>>>>>> for several months), and the metadata part already has some
>>>>>>> difference
>>>>>>> in its behavior, which needs review.
>>>>>>>
>>>>>>> You point makes some sense, but I still don't believe pushing a
>>>>>>> super
>>>>>>> large patchset does any help for the review.
>>>>>>>
>>>>>>> If you want to test, you can grab the branch from the github repo.
>>>>>>> If you want to review, the mails are all here for review.
>>>>>>>
>>>>>>> In fact, we used to have subpage support sent as a big patchset
>>>>>>> from IBM
>>>>>>> guys, but the result is only some preparation patches get merged,
>>>>>>> and
>>>>>>> nothing more.
>>>>>>>
>>>>>>> Using this multi-series method, we're already doing better work and
>>>>>>> received more testing (to ensure regular sectorsize is not
>>>>>>> affectedat
>>>>>>> least).
>>>>>>
>>>>>> Hi Qu Wenruo,
>>>>>>
>>>>>> Sorry about chiming in late on this. I don't have any strong
>>>>>> objection on either
>>>>>> approach. Although sometime back when I tested your RW support git
>>>>>> tree on
>>>>>> Power, the unmount patch itself was crashing. I didn't debug it
>>>>>> thattime
>>>>>> (this was a month back or so), so I also didn't bother testing
>>>>>> xfstests on Power.
>>>>>>
>>>>>> But we do have an interest in making sure this patch series work
>>>>>> on bs < ps
>>>>>> on Power platform. I can try helping with testing, reviewing (to
>>>>>> best of my
>>>>>> knowledge) and fixing anything is possible :)
>>>>>
>>>>> That's great!
>>>>>
>>>>> One of my biggest problem here is, I don't have good enough testing
>>>>> environment.
>>>>>
>>>>> Although SUSE has internal clouds for ARM64/PPC64, but due to the
>>>>> f**king Great Firewall, it's super slow to access, no to mention doing
>>>>> proper debugging.
>>>>>
>>>>> Currently I'm using two ARM SBCs, RK3399 and A311D based, to do the
>>>>> test.
>>>>> But their computing power is far from ideal, only generic/quick can
>>>>> finish in hours.
>>>>>
>>>>> Thus real world Power could definitely help.
>>>>>>
>>>>>> Let me try and pull your tree and test it on Power. Please let me
>>>>>> know if there
>>>>>> is anything needs to be taken care apart from your github tree and
>>>>>> btrfs-progs
>>>>>> branch with bs < ps support.
>>>>>
>>>>> If you're going to test the branch, here are some small notes:
>>>>>
>>>>> - Need to use latest btrfs-progs
>>>>>     As it fixes a false alert on crossing 64K page boundary.
>>>>>
>>>>> - Need to slightly modify btrfs-progs to avoid false alerts
>>>>>     For subpage case, mkfs.btrfs will output a warning, but that
>>>>> warning
>>>>>     is outputted into stderr, which will screw up generic test groups.
>>>>>     It's recommended to apply the following diff:
>>>>>
>>>>> diff --git a/common/fsfeatures.c b/common/fsfeatures.c
>>>>> index 569208a9..21976554 100644
>>>>> --- a/common/fsfeatures.c
>>>>> +++ b/common/fsfeatures.c
>>>>> @@ -341,8 +341,8 @@ int btrfs_check_sectorsize(u32 sectorsize)
>>>>>                   return -EINVAL;
>>>>>           }
>>>>>           if (page_size != sectorsize)
>>>>> -               warning(
>>>>> -"the filesystem may not be mountable, sectorsize %u doesn't match
>>>>> page
>>>>> size %u",
>>>>> +               printf(
>>>>> +"the filesystem may not be mountable, sectorsize %u doesn't match
>>>>> page
>>>>> size %u\n",
>>>>>                           sectorsize, page_size);
>>>>>           return 0;
>>>>>    }
>>>>>
>>>>> - Xfstest/btrfs group will crash at btrfs/143
>>>>>     Still investigating, but you can ignore btrfs group for now.
>>>>>
>>>>> - Very rare hang
>>>>>     There is a very low change to hang, with "bad ordered accounting"
>>>>>     dmesg.
>>>>>     If you can hit, please let me know.
>>>>>     I had something idea to fix it, but not yet in the branch.
>>>>>
>>>>> - btrfs inode nbytes mismatch
>>>>>     Investigating, as it will make btrfs-check to report error.
>>>>>
>>>>> The last two bugs are the final show blocker, I'll give you extra
>>>>> updates when those are fixed.
>>>>
>>>> Thanks Qu Wenruo, for above info.
>>>> I cloned below git tree as mentioned in your git log to test for RW
>>>> onPower.
>>>> However, I still see that RW mount for bs < ps is disabled for in
>>>> open_ctree()
>>>> https://github.com/adam900710/linux/tree/subpage
>>>>
>>>> I see below code present in this tree.
>>>>            /* For 4K sector size support, it's only read-only */
>>>>            if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
>>>>                    if (!sb_rdonly(sb) ||
>>>> btrfs_super_log_root(disk_super)) {
>>>>                            btrfs_err(fs_info,
>>>>            "subpage sectorsize %u only supported read-only for page
>>>> size %lu",
>>>>                                    sectorsize, PAGE_SIZE);
>>>>                            err = -EINVAL;
>>>>                            goto fail_alloc;
>>>>                    }
>>>>            }
>>>>
>>>> Could you pls point me to the tree I can use for bs < ps testing on
>>>> Power?
>>>> Sorry if I missed something.
>>>
>>> Sorry, I updated the branch to my current development progress, it's now
>>> at the ordered extent rework part, without the remaining subpage
>>> functionality at all.
>>>
>>> You may want to grab this tree instead:
>>> https://github.com/adam900710/linux/tree/subpage_old
>>>
>>> But please keep in mind that, you may get random hang, and certain
>>> generic test case, especially generic/075 can corrupt the inode nbytes
>>> and leaving all later test cases using TEST_DEV to report error on fsck.
>>>
>>
>> Thanks for quick response. Sure, I will exclude generic/075 from the test
>> for now.
>
> Not only generic/075, but all tests running fsx may cause inode nbytes
> corruption.
>
> Thus I'd recommend either modify btrfs-check to ignore it, or re-mkfs on
> TEST_DEV after each test case.

Good news, you can fetch the subpage branch for better test results.

Now the branch should pass all generic tests, except defrag and known
failures.
And no more random crash during the tests.

And for btrfs/143, it will no longer trigger a BUG_ON(), although at the
cost of worse granularity for repair.
(Now it's per-bvec repair, not yet fully per-sector repair).

I'll rebase the branch in recent days to latest misc-next, but the
current branch is already good enough for full subapge RW support.

Thanks,
Qu
>
> Thanks,
> Qu
>
>>
>> -ritesh
>>

Ritesh Harjani April 15, 2021, 3:44 a.m. UTC | #22

On 21/04/12 07:33PM, Qu Wenruo wrote:
>
>
> On 2021/4/2 下午4:52, Qu Wenruo wrote:
> >
> >
> > On 2021/4/2 下午4:46, Ritesh Harjani wrote:
> > > On 21/04/02 04:36PM, Qu Wenruo wrote:
> > > >
> > > >
> > > > On 2021/4/2 下午4:33, Ritesh Harjani wrote:
> > > > > On 21/03/29 10:01AM, Qu Wenruo wrote:
> > > > > >
> > > > > >
> > > > > > On 2021/3/29 上午4:02, Ritesh Harjani wrote:
> > > > > > > On 21/03/25 09:16PM, Qu Wenruo wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > On 2021/3/25 下午8:20, Neal Gompa wrote:
> > > > > > > > > On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <wqu@suse.com> wrote:
> > > > > > > > > >
> > > > > > > > > > This patchset can be fetched from the following github repo,
> > > > > > > > > > along with
> > > > > > > > > > the full subpage RW support:
> > > > > > > > > > https://github.com/adam900710/linux/tree/subpage
> > > > > > > > > >
> > > > > > > > > > This patchset is for metadata read write support.
> > > > > > > > > >
> > > > > > > > > > [FULL RW TEST]
> > > > > > > > > > Since the data write path is not included in this patchset, we
> > > > > > > > > > can't
> > > > > > > > > > really test the patchset itself, but anyone can grab the patch
> > > > > > > > > > from
> > > > > > > > > > github repo and do fstests/generic tests.
> > > > > > > > > >
> > > > > > > > > > But at least the full RW patchset can pass -g generic/quick -x
> > > > > > > > > > defrag
> > > > > > > > > > for now.
> > > > > > > > > >
> > > > > > > > > > There are some known issues:
> > > > > > > > > >
> > > > > > > > > > - Defrag behavior change
> > > > > > > > > >       Since current defrag is doing per-page defrag, to support
> > > > > > > > > > subpage
> > > > > > > > > >       defrag, we need some change in the loop.
> > > > > > > > > >       E.g. if a page has both hole and regular extents in it,
> > > > > > > > > > then defrag
> > > > > > > > > >       will rewrite the full 64K page.
> > > > > > > > > >
> > > > > > > > > >       Thus for now, defrag related failure is expected.
> > > > > > > > > >       But this should only cause behavior difference, no crash
> > > > > > > > > > nor hang is
> > > > > > > > > >       expected.
> > > > > > > > > >
> > > > > > > > > > - No compression support yet
> > > > > > > > > >       There are at least 2 known bugs if forcing compression
> > > > > > > > > > for subpage
> > > > > > > > > >       * Some hard coded PAGE_SIZE screwing up space rsv
> > > > > > > > > >       * Subpage ASSERT() triggered
> > > > > > > > > >         This is because some compression code is unlocking
> > > > > > > > > > locked_page by
> > > > > > > > > >         calling extent_clear_unlock_delalloc() with locked_page
> > > > > > > > > > == NULL.
> > > > > > > > > >       So for now compression is also disabled.
> > > > > > > > > >
> > > > > > > > > > - Inode nbytes mismatch
> > > > > > > > > >       Still debugging.
> > > > > > > > > >       The fastest way to trigger is fsx using the following
> > > > > > > > > > parameters:
> > > > > > > > > >
> > > > > > > > > >         fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file
> > > > > > > > > > > /tmp/fsx
> > > > > > > > > >
> > > > > > > > > >       Which would cause inode nbytes differs from expected
> > > > > > > > > > value and
> > > > > > > > > >       triggers btrfs check error.
> > > > > > > > > >
> > > > > > > > > > [DIFFERENCE AGAINST REGULAR SECTORSIZE]
> > > > > > > > > > The metadata part in fact has more new code than data part, as
> > > > > > > > > > ithas
> > > > > > > > > > some different behaviors compared to the regular sector size
> > > > > > > > > > handling:
> > > > > > > > > >
> > > > > > > > > > - No more page locking
> > > > > > > > > >       Now metadata read/write relies on extent io tree locking,
> > > > > > > > > > other than
> > > > > > > > > >       page locking.
> > > > > > > > > >       This is to allow behaviors like read lock one eb while
> > > > > > > > > > alsotry to
> > > > > > > > > >       read lock another eb in the same page.
> > > > > > > > > >       We can't rely on page lock as now we have multiple extent
> > > > > > > > > > buffers in
> > > > > > > > > >       the same page.
> > > > > > > > > >
> > > > > > > > > > - Page status update
> > > > > > > > > >       Now we use subpage wrappers to handle page status update.
> > > > > > > > > >
> > > > > > > > > > - How to submit dirty extent buffers
> > > > > > > > > >       Instead of just grabbing extent buffer from
> > > > > > > > > > page::private, we need to
> > > > > > > > > >       iterate all dirty extent buffers in the page and submit
> > > > > > > > > > them.
> > > > > > > > > >
> > > > > > > > > > [CHANGELOG]
> > > > > > > > > > v2:
> > > > > > > > > > - Rebased to latest misc-next
> > > > > > > > > >       No conflicts at all.
> > > > > > > > > >
> > > > > > > > > > - Add new sysfs interface to grab supported RO/RW sectorsize
> > > > > > > > > >       This will allow mkfs.btrfs to detect unmountable fs better.
> > > > > > > > > >
> > > > > > > > > > - Use newer naming schema for each patch
> > > > > > > > > >       No more "extent_io:" or "inode:" schema anymore.
> > > > > > > > > >
> > > > > > > > > > - Move two pure cleanups to the series
> > > > > > > > > >       Patch 2~3, originally in RW part.
> > > > > > > > > >
> > > > > > > > > > - Fix one uninitialized variable
> > > > > > > > > >       Patch 6.
> > > > > > > > > >
> > > > > > > > > > v3:
> > > > > > > > > > - Rename the sysfs to supported_sectorsizes
> > > > > > > > > >
> > > > > > > > > > - Rebased to latest misc-next branch
> > > > > > > > > >       This removes 2 cleanup patches.
> > > > > > > > > >
> > > > > > > > > > - Add new overview comment for subpage metadata
> > > > > > > > > >
> > > > > > > > > > Qu Wenruo (13):
> > > > > > > > > >       btrfs: add sysfs interface for supported sectorsize
> > > > > > > > > >       btrfs: use min() to replace open-code in
> > > > > > > > > > btrfs_invalidatepage()
> > > > > > > > > >       btrfs: remove unnecessary variable shadowing in
> > > > > > > > > > btrfs_invalidatepage()
> > > > > > > > > >       btrfs: refactor how we iterate ordered extent in
> > > > > > > > > >         btrfs_invalidatepage()
> > > > > > > > > >       btrfs: introduce helpers for subpage dirty status
> > > > > > > > > >       btrfs: introduce helpers for subpage writeback status
> > > > > > > > > >       btrfs: allow btree_set_page_dirty() to do more sanity
> > > > > > > > > > checkon subpage
> > > > > > > > > >         metadata
> > > > > > > > > >       btrfs: support subpage metadata csum calculation at write
> > > > > > > > > > time
> > > > > > > > > >       btrfs: make alloc_extent_buffer() check subpage dirty bitmap
> > > > > > > > > >       btrfs: make the page uptodate assert to be subpage
> > > > > > > > > > compatible
> > > > > > > > > >       btrfs: make set/clear_extent_buffer_dirty() to be subpage
> > > > > > > > > > compatible
> > > > > > > > > >       btrfs: make set_btree_ioerr() accept extent buffer and to
> > > > > > > > > > be subpage
> > > > > > > > > >         compatible
> > > > > > > > > >       btrfs: add subpage overview comments
> > > > > > > > > >
> > > > > > > > > >      fs/btrfs/disk-io.c   | 143
> > > > > > > > > > ++++++++++++++++++++++++++++++++++---------
> > > > > > > > > >      fs/btrfs/extent_io.c | 127
> > > > > > > > > > ++++++++++++++++++++++++++++----------
> > > > > > > > > >      fs/btrfs/inode.c     | 128
> > > > > > > > > > ++++++++++++++++++++++----------------
> > > > > > > > > >      fs/btrfs/subpage.c   | 127
> > > > > > > > > > ++++++++++++++++++++++++++++++++++++++
> > > > > > > > > >      fs/btrfs/subpage.h   |  17 +++++
> > > > > > > > > >      fs/btrfs/sysfs.c     |  15 +++++
> > > > > > > > > >      6 files changed, 441 insertions(+), 116 deletions(-)
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > 2.30.1
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Why wouldn't we just integrate full read-write support with the
> > > > > > > > > caveats as described now? It seems to be relatively reasonable
> > > > > > > > > to do
> > > > > > > > > that, and this patch set is essentially unusable without the
> > > > > > > > > rest of
> > > > > > > > > it that does enable full read-write support.
> > > > > > > >
> > > > > > > > The metadata part is much more stable than data path (almost not
> > > > > > > > touched
> > > > > > > > for several months), and the metadata part already has some
> > > > > > > > difference
> > > > > > > > in its behavior, which needs review.
> > > > > > > >
> > > > > > > > You point makes some sense, but I still don't believe pushing a
> > > > > > > > super
> > > > > > > > large patchset does any help for the review.
> > > > > > > >
> > > > > > > > If you want to test, you can grab the branch from the github repo.
> > > > > > > > If you want to review, the mails are all here for review.
> > > > > > > >
> > > > > > > > In fact, we used to have subpage support sent as a big patchset
> > > > > > > > from IBM
> > > > > > > > guys, but the result is only some preparation patches get merged,
> > > > > > > > and
> > > > > > > > nothing more.
> > > > > > > >
> > > > > > > > Using this multi-series method, we're already doing better work and
> > > > > > > > received more testing (to ensure regular sectorsize is not
> > > > > > > > affectedat
> > > > > > > > least).
> > > > > > >
> > > > > > > Hi Qu Wenruo,
> > > > > > >
> > > > > > > Sorry about chiming in late on this. I don't have any strong
> > > > > > > objection on either
> > > > > > > approach. Although sometime back when I tested your RW support git
> > > > > > > tree on
> > > > > > > Power, the unmount patch itself was crashing. I didn't debug it
> > > > > > > thattime
> > > > > > > (this was a month back or so), so I also didn't bother testing
> > > > > > > xfstests on Power.
> > > > > > >
> > > > > > > But we do have an interest in making sure this patch series work
> > > > > > > on bs < ps
> > > > > > > on Power platform. I can try helping with testing, reviewing (to
> > > > > > > best of my
> > > > > > > knowledge) and fixing anything is possible :)
> > > > > >
> > > > > > That's great!
> > > > > >
> > > > > > One of my biggest problem here is, I don't have good enough testing
> > > > > > environment.
> > > > > >
> > > > > > Although SUSE has internal clouds for ARM64/PPC64, but due to the
> > > > > > f**king Great Firewall, it's super slow to access, no to mention doing
> > > > > > proper debugging.
> > > > > >
> > > > > > Currently I'm using two ARM SBCs, RK3399 and A311D based, to do the
> > > > > > test.
> > > > > > But their computing power is far from ideal, only generic/quick can
> > > > > > finish in hours.
> > > > > >
> > > > > > Thus real world Power could definitely help.
> > > > > > >
> > > > > > > Let me try and pull your tree and test it on Power. Please let me
> > > > > > > know if there
> > > > > > > is anything needs to be taken care apart from your github tree and
> > > > > > > btrfs-progs
> > > > > > > branch with bs < ps support.
> > > > > >
> > > > > > If you're going to test the branch, here are some small notes:
> > > > > >
> > > > > > - Need to use latest btrfs-progs
> > > > > >     As it fixes a false alert on crossing 64K page boundary.
> > > > > >
> > > > > > - Need to slightly modify btrfs-progs to avoid false alerts
> > > > > >     For subpage case, mkfs.btrfs will output a warning, but that
> > > > > > warning
> > > > > >     is outputted into stderr, which will screw up generic test groups.
> > > > > >     It's recommended to apply the following diff:
> > > > > >
> > > > > > diff --git a/common/fsfeatures.c b/common/fsfeatures.c
> > > > > > index 569208a9..21976554 100644
> > > > > > --- a/common/fsfeatures.c
> > > > > > +++ b/common/fsfeatures.c
> > > > > > @@ -341,8 +341,8 @@ int btrfs_check_sectorsize(u32 sectorsize)
> > > > > >                   return -EINVAL;
> > > > > >           }
> > > > > >           if (page_size != sectorsize)
> > > > > > -               warning(
> > > > > > -"the filesystem may not be mountable, sectorsize %u doesn't match
> > > > > > page
> > > > > > size %u",
> > > > > > +               printf(
> > > > > > +"the filesystem may not be mountable, sectorsize %u doesn't match
> > > > > > page
> > > > > > size %u\n",
> > > > > >                           sectorsize, page_size);
> > > > > >           return 0;
> > > > > >    }
> > > > > >
> > > > > > - Xfstest/btrfs group will crash at btrfs/143
> > > > > >     Still investigating, but you can ignore btrfs group for now.
> > > > > >
> > > > > > - Very rare hang
> > > > > >     There is a very low change to hang, with "bad ordered accounting"
> > > > > >     dmesg.
> > > > > >     If you can hit, please let me know.
> > > > > >     I had something idea to fix it, but not yet in the branch.
> > > > > >
> > > > > > - btrfs inode nbytes mismatch
> > > > > >     Investigating, as it will make btrfs-check to report error.
> > > > > >
> > > > > > The last two bugs are the final show blocker, I'll give you extra
> > > > > > updates when those are fixed.
> > > > >
> > > > > Thanks Qu Wenruo, for above info.
> > > > > I cloned below git tree as mentioned in your git log to test for RW
> > > > > onPower.
> > > > > However, I still see that RW mount for bs < ps is disabled for in
> > > > > open_ctree()
> > > > > https://github.com/adam900710/linux/tree/subpage
> > > > >
> > > > > I see below code present in this tree.
> > > > >            /* For 4K sector size support, it's only read-only */
> > > > >            if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
> > > > >                    if (!sb_rdonly(sb) ||
> > > > > btrfs_super_log_root(disk_super)) {
> > > > >                            btrfs_err(fs_info,
> > > > >            "subpage sectorsize %u only supported read-only for page
> > > > > size %lu",
> > > > >                                    sectorsize, PAGE_SIZE);
> > > > >                            err = -EINVAL;
> > > > >                            goto fail_alloc;
> > > > >                    }
> > > > >            }
> > > > >
> > > > > Could you pls point me to the tree I can use for bs < ps testing on
> > > > > Power?
> > > > > Sorry if I missed something.
> > > >
> > > > Sorry, I updated the branch to my current development progress, it's now
> > > > at the ordered extent rework part, without the remaining subpage
> > > > functionality at all.
> > > >
> > > > You may want to grab this tree instead:
> > > > https://github.com/adam900710/linux/tree/subpage_old
> > > >
> > > > But please keep in mind that, you may get random hang, and certain
> > > > generic test case, especially generic/075 can corrupt the inode nbytes
> > > > and leaving all later test cases using TEST_DEV to report error on fsck.
> > > >
> > >
> > > Thanks for quick response. Sure, I will exclude generic/075 from the test
> > > for now.
> >
> > Not only generic/075, but all tests running fsx may cause inode nbytes
> > corruption.
> >
> > Thus I'd recommend either modify btrfs-check to ignore it, or re-mkfs on
> > TEST_DEV after each test case.
>
> Good news, you can fetch the subpage branch for better test results.
>
> Now the branch should pass all generic tests, except defrag and known
> failures.
> And no more random crash during the tests.

Thanks, let me test it on PPC64 box.

-ritesh

>
> And for btrfs/143, it will no longer trigger a BUG_ON(), although at the
> cost of worse granularity for repair.
> (Now it's per-bvec repair, not yet fully per-sector repair).
>
> I'll rebase the branch in recent days to latest misc-next, but the
> current branch is already good enough for full subapge RW support.
>
> Thanks,
> Qu
> >
> > Thanks,
> > Qu
> >
> > >
> > > -ritesh
> > >

Ritesh Harjani April 15, 2021, 2:52 p.m. UTC | #23

On 21/04/15 09:14AM, riteshh wrote:
> On 21/04/12 07:33PM, Qu Wenruo wrote:
> > Good news, you can fetch the subpage branch for better test results.
> >
> > Now the branch should pass all generic tests, except defrag and known
> > failures.
> > And no more random crash during the tests.
>
> Thanks, let me test it on PPC64 box.

I do see some failures remaining with the patch series.
However the one which is blocking my testing is the tests/generic/095
I see kernel BUG hitting with below signature.

Please let me know if this a known failure?

<xfstests config>
#:~/work-tools/xfstests$ sudo ./check -g auto
SECTION       -- btrfs_4k
FSTYP         -- btrfs
PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73 SMP Thu Apr 15 07:29:23 CDT 2021
MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch


<kernel logs>
[ 6057.560580] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
[ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
[ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
[ 6058.348910] BTRFS info (device loop2): has skinny extents
[ 6058.351930] BTRFS warning (device loop2): read-write for sector size 4096 with page size 65536 is experimental
[ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
[ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
[ 6060.226213] BTRFS info (device loop3): has skinny extents
[ 6060.227084] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
[ 6060.234537] BTRFS info (device loop3): checking UUID tree
[ 6061.375902] assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:171
[ 6061.378296] ------------[ cut here ]------------
[ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
    pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
    lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
    sp: c0000000260d7730
   msr: 800000000282b033
  current = 0xc0000000260c0080
  paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
    pid   = 739712, comm = fio
kernel BUG at fs/btrfs/ctree.h:3403!
Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
enter ? for help
[c0000000260d7790] c000000000a90280 btrfs_subpage_assert.isra.9+0x70/0x110
[c0000000260d77b0] c000000000a91064 btrfs_subpage_set_uptodate+0x54/0x110
[c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
[c0000000260d7880] c0000000009c7298 btrfs_buffered_write+0x488/0x7f0
[c0000000260d79d0] c0000000009cbeb4 btrfs_file_write_iter+0x314/0x520
[c0000000260d7a50] c00000000055fd84 do_iter_readv_writev+0x1b4/0x260
[c0000000260d7ac0] c00000000056114c do_iter_write+0xdc/0x2c0
[c0000000260d7b10] c0000000005c2d2c iter_file_splice_write+0x2ec/0x510
[c0000000260d7c30] c0000000005c1ba0 do_splice_from+0x50/0x70
[c0000000260d7c50] c0000000005c37e8 do_splice+0x5a8/0x910
[c0000000260d7cd0] c0000000005c3ce0 sys_splice+0x190/0x300
[c0000000260d7d60] c000000000039ba4 system_call_exception+0x384/0x3d0
[c0000000260d7e10] c00000000000d45c system_call_common+0xec/0x278
--- Exception: c00 (System Call) at 00007ffff72ef170


-ritesh

Qu Wenruo April 15, 2021, 11:19 p.m. UTC | #24

On 2021/4/15 下午10:52, riteshh wrote:
> On 21/04/15 09:14AM, riteshh wrote:
>> On 21/04/12 07:33PM, Qu Wenruo wrote:
>>> Good news, you can fetch the subpage branch for better test results.
>>>
>>> Now the branch should pass all generic tests, except defrag and known
>>> failures.
>>> And no more random crash during the tests.
>>
>> Thanks, let me test it on PPC64 box.
>
> I do see some failures remaining with the patch series.
> However the one which is blocking my testing is the tests/generic/095
> I see kernel BUG hitting with below signature.

That's pretty different from my tests.

As I haven't seen such BUG_ON() for a while.


>
> Please let me know if this a known failure?
>
> <xfstests config>
> #:~/work-tools/xfstests$ sudo ./check -g auto
> SECTION       -- btrfs_4k
> FSTYP         -- btrfs
> PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73 SMP Thu Apr 15 07:29:23 CDT 2021
> MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3

I see you're using -n 4096, not the default -n 16K, let me see if I can
reproduce that.

But from the backtrace, it doesn't look like the case,
as it happens for data path, which means it's only related to sectorsize.

> MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
>
>
> <kernel logs>
> [ 6057.560580] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
> [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
> [ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
> [ 6058.348910] BTRFS info (device loop2): has skinny extents
> [ 6058.351930] BTRFS warning (device loop2): read-write for sector size 4096 with page size 65536 is experimental
> [ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
> [ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
> [ 6060.226213] BTRFS info (device loop3): has skinny extents
> [ 6060.227084] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
> [ 6060.234537] BTRFS info (device loop3): checking UUID tree
> [ 6061.375902] assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:171
> [ 6061.378296] ------------[ cut here ]------------
> [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
> cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
>      pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
>      lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
>      sp: c0000000260d7730
>     msr: 800000000282b033
>    current = 0xc0000000260c0080
>    paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
>      pid   = 739712, comm = fio
> kernel BUG at fs/btrfs/ctree.h:3403!
> Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
> enter ? for help
> [c0000000260d7790] c000000000a90280 btrfs_subpage_assert.isra.9+0x70/0x110
> [c0000000260d77b0] c000000000a91064 btrfs_subpage_set_uptodate+0x54/0x110
> [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0

This is very strange.
As in btrfs_dirty_pages(), the pages passed in are already prepared by
prepare_pages(), which means all of them should have Private set.

Can you reproduce the bug reliable?

BTW, are using running the latest branch, with this commit at top?

commit 3490dae50c01cec04364e5288f43ae9ac9eca2c9
Author: Qu Wenruo <wqu@suse.com>
Date:   Mon Feb 22 14:19:38 2021 +0800

     btrfs: allow read-write for 4K sectorsize on 64K page size systems

As I was updating the patchset until the last minute.

Thanks,
Qu

> [c0000000260d7880] c0000000009c7298 btrfs_buffered_write+0x488/0x7f0
> [c0000000260d79d0] c0000000009cbeb4 btrfs_file_write_iter+0x314/0x520
> [c0000000260d7a50] c00000000055fd84 do_iter_readv_writev+0x1b4/0x260
> [c0000000260d7ac0] c00000000056114c do_iter_write+0xdc/0x2c0
> [c0000000260d7b10] c0000000005c2d2c iter_file_splice_write+0x2ec/0x510
> [c0000000260d7c30] c0000000005c1ba0 do_splice_from+0x50/0x70
> [c0000000260d7c50] c0000000005c37e8 do_splice+0x5a8/0x910
> [c0000000260d7cd0] c0000000005c3ce0 sys_splice+0x190/0x300
> [c0000000260d7d60] c000000000039ba4 system_call_exception+0x384/0x3d0
> [c0000000260d7e10] c00000000000d45c system_call_common+0xec/0x278
> --- Exception: c00 (System Call) at 00007ffff72ef170
>
>
> -ritesh
>

Qu Wenruo April 15, 2021, 11:34 p.m. UTC | #25

On 2021/4/16 上午7:19, Qu Wenruo wrote:
>
>
> On 2021/4/15 下午10:52, riteshh wrote:
>> On 21/04/15 09:14AM, riteshh wrote:
>>> On 21/04/12 07:33PM, Qu Wenruo wrote:
>>>> Good news, you can fetch the subpage branch for better test results.
>>>>
>>>> Now the branch should pass all generic tests, except defrag and known
>>>> failures.
>>>> And no more random crash during the tests.
>>>
>>> Thanks, let me test it on PPC64 box.
>>
>> I do see some failures remaining with the patch series.
>> However the one which is blocking my testing is the tests/generic/095
>> I see kernel BUG hitting with below signature.
>
> That's pretty different from my tests.
>
> As I haven't seen such BUG_ON() for a while.
>
>
>>
>> Please let me know if this a known failure?
>>
>> <xfstests config>
>> #:~/work-tools/xfstests$ sudo ./check -g auto
>> SECTION       -- btrfs_4k
>> FSTYP         -- btrfs
>> PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
>> SMP Thu Apr 15 07:29:23 CDT 2021
>> MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
>
> I see you're using -n 4096, not the default -n 16K, let me see if I can
> reproduce that.
>
> But from the backtrace, it doesn't look like the case,
> as it happens for data path, which means it's only related to sectorsize.
>
>> MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
>>
>>
>> <kernel logs>
>> [ 6057.560580] BTRFS warning (device loop3): read-write for sector
>> size 4096 with page size 65536 is experimental
>> [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
>> [ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
>> [ 6058.348910] BTRFS info (device loop2): has skinny extents
>> [ 6058.351930] BTRFS warning (device loop2): read-write for sector
>> size 4096 with page size 65536 is experimental
>> [ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
>> devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
>> [ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
>> [ 6060.226213] BTRFS info (device loop3): has skinny extents
>> [ 6060.227084] BTRFS warning (device loop3): read-write for sector
>> size 4096 with page size 65536 is experimental
>> [ 6060.234537] BTRFS info (device loop3): checking UUID tree
>> [ 6061.375902] assertion failed: PagePrivate(page) && page->private,
>> in fs/btrfs/subpage.c:171
>> [ 6061.378296] ------------[ cut here ]------------
>> [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
>> cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
>>      pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
>>      lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
>>      sp: c0000000260d7730
>>     msr: 800000000282b033
>>    current = 0xc0000000260c0080
>>    paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
>>      pid   = 739712, comm = fio
>> kernel BUG at fs/btrfs/ctree.h:3403!
>> Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc
>> (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
>> 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
>> enter ? for help
>> [c0000000260d7790] c000000000a90280
>> btrfs_subpage_assert.isra.9+0x70/0x110
>> [c0000000260d77b0] c000000000a91064 btrfs_subpage_set_uptodate+0x54/0x110
>> [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
>
> This is very strange.
> As in btrfs_dirty_pages(), the pages passed in are already prepared by
> prepare_pages(), which means all of them should have Private set.
>
> Can you reproduce the bug reliable?

OK, I got it reproduced.

It's not a reliable BUG_ON(), but can be reproduced.
The test get skipped for all my boards as it requires fio tool, thus I
didn't get it triggered for all previous runs.

I'll take a look into the case.

Thanks for the report,
Qu
>
> BTW, are using running the latest branch, with this commit at top?
>
> commit 3490dae50c01cec04364e5288f43ae9ac9eca2c9
> Author: Qu Wenruo <wqu@suse.com>
> Date:   Mon Feb 22 14:19:38 2021 +0800
>
>     btrfs: allow read-write for 4K sectorsize on 64K page size systems
>
> As I was updating the patchset until the last minute.
>
> Thanks,
> Qu
>
>> [c0000000260d7880] c0000000009c7298 btrfs_buffered_write+0x488/0x7f0
>> [c0000000260d79d0] c0000000009cbeb4 btrfs_file_write_iter+0x314/0x520
>> [c0000000260d7a50] c00000000055fd84 do_iter_readv_writev+0x1b4/0x260
>> [c0000000260d7ac0] c00000000056114c do_iter_write+0xdc/0x2c0
>> [c0000000260d7b10] c0000000005c2d2c iter_file_splice_write+0x2ec/0x510
>> [c0000000260d7c30] c0000000005c1ba0 do_splice_from+0x50/0x70
>> [c0000000260d7c50] c0000000005c37e8 do_splice+0x5a8/0x910
>> [c0000000260d7cd0] c0000000005c3ce0 sys_splice+0x190/0x300
>> [c0000000260d7d60] c000000000039ba4 system_call_exception+0x384/0x3d0
>> [c0000000260d7e10] c00000000000d45c system_call_common+0xec/0x278
>> --- Exception: c00 (System Call) at 00007ffff72ef170
>>
>>
>> -ritesh
>>

Qu Wenruo April 16, 2021, 1:34 a.m. UTC | #26

On 2021/4/16 上午7:34, Qu Wenruo wrote:
>
>
> On 2021/4/16 上午7:19, Qu Wenruo wrote:
>>
>>
>> On 2021/4/15 下午10:52, riteshh wrote:
>>> On 21/04/15 09:14AM, riteshh wrote:
>>>> On 21/04/12 07:33PM, Qu Wenruo wrote:
>>>>> Good news, you can fetch the subpage branch for better test results.
>>>>>
>>>>> Now the branch should pass all generic tests, except defrag and known
>>>>> failures.
>>>>> And no more random crash during the tests.
>>>>
>>>> Thanks, let me test it on PPC64 box.
>>>
>>> I do see some failures remaining with the patch series.
>>> However the one which is blocking my testing is the tests/generic/095
>>> I see kernel BUG hitting with below signature.
>>
>> That's pretty different from my tests.
>>
>> As I haven't seen such BUG_ON() for a while.
>>
>>
>>>
>>> Please let me know if this a known failure?
>>>
>>> <xfstests config>
>>> #:~/work-tools/xfstests$ sudo ./check -g auto
>>> SECTION       -- btrfs_4k
>>> FSTYP         -- btrfs
>>> PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
>>> SMP Thu Apr 15 07:29:23 CDT 2021
>>> MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
>>
>> I see you're using -n 4096, not the default -n 16K, let me see if I can
>> reproduce that.
>>
>> But from the backtrace, it doesn't look like the case,
>> as it happens for data path, which means it's only related to sectorsize.
>>
>>> MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
>>>
>>>
>>> <kernel logs>
>>> [ 6057.560580] BTRFS warning (device loop3): read-write for sector
>>> size 4096 with page size 65536 is experimental
>>> [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
>>> [ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
>>> [ 6058.348910] BTRFS info (device loop2): has skinny extents
>>> [ 6058.351930] BTRFS warning (device loop2): read-write for sector
>>> size 4096 with page size 65536 is experimental
>>> [ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
>>> devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
>>> [ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
>>> [ 6060.226213] BTRFS info (device loop3): has skinny extents
>>> [ 6060.227084] BTRFS warning (device loop3): read-write for sector
>>> size 4096 with page size 65536 is experimental
>>> [ 6060.234537] BTRFS info (device loop3): checking UUID tree
>>> [ 6061.375902] assertion failed: PagePrivate(page) && page->private,
>>> in fs/btrfs/subpage.c:171
>>> [ 6061.378296] ------------[ cut here ]------------
>>> [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
>>> cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
>>>      pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
>>>      lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
>>>      sp: c0000000260d7730
>>>     msr: 800000000282b033
>>>    current = 0xc0000000260c0080
>>>    paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
>>>      pid   = 739712, comm = fio
>>> kernel BUG at fs/btrfs/ctree.h:3403!
>>> Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc
>>> (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
>>> 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
>>> enter ? for help
>>> [c0000000260d7790] c000000000a90280
>>> btrfs_subpage_assert.isra.9+0x70/0x110
>>> [c0000000260d77b0] c000000000a91064
>>> btrfs_subpage_set_uptodate+0x54/0x110
>>> [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
>>
>> This is very strange.
>> As in btrfs_dirty_pages(), the pages passed in are already prepared by
>> prepare_pages(), which means all of them should have Private set.
>>
>> Can you reproduce the bug reliable?
>
> OK, I got it reproduced.
>
> It's not a reliable BUG_ON(), but can be reproduced.
> The test get skipped for all my boards as it requires fio tool, thus I
> didn't get it triggered for all previous runs.
>
> I'll take a look into the case.

This exposed an interesting race window in btrfs_buffered_write():
         Writer                    |             fadvice
----------------------------------+-------------------------------
btrfs_buffered_write()            |
|- prepare_pages()                |
|  |- Now all pages involved get  |
|     Private set                 |
|                                 | btrfs_release_page()
|                                 | |- Clear page Private
|- lock_extent()                  |
|  |- This would prevent          |
|     btrfs_release_page() to     |
|     clear the page Private      |
|
|- btrfs_dirty_page()
    |- Will trigger the BUG_ON()

This only happens for subpage, because subpage introduces new ASSERT()
to do extra check.

If we want to speak strictly, regular sector size should also report
this problem.
But regular sector size case doesn't really care about page Private, as
it just set page->private to a constant value, unlike subpage case which
stores important value.

The fix will just re-set page Private and needed structures in
btrfs_dirty_page(), under extent locked so no btrfs_releasepage() is
able to release it anymore.

The fix is already added to the github branch.
Now it has the fix as the HEAD.

I hope this won't damage your confidence on the patchset.

Thanks for the report!
Qu

>
> Thanks for the report,
> Qu
>>
>> BTW, are using running the latest branch, with this commit at top?
>>
>> commit 3490dae50c01cec04364e5288f43ae9ac9eca2c9
>> Author: Qu Wenruo <wqu@suse.com>
>> Date:   Mon Feb 22 14:19:38 2021 +0800
>>
>>     btrfs: allow read-write for 4K sectorsize on 64K page sizesystems
>>
>> As I was updating the patchset until the last minute.
>>
>> Thanks,
>> Qu
>>
>>> [c0000000260d7880] c0000000009c7298 btrfs_buffered_write+0x488/0x7f0
>>> [c0000000260d79d0] c0000000009cbeb4 btrfs_file_write_iter+0x314/0x520
>>> [c0000000260d7a50] c00000000055fd84 do_iter_readv_writev+0x1b4/0x260
>>> [c0000000260d7ac0] c00000000056114c do_iter_write+0xdc/0x2c0
>>> [c0000000260d7b10] c0000000005c2d2c iter_file_splice_write+0x2ec/0x510
>>> [c0000000260d7c30] c0000000005c1ba0 do_splice_from+0x50/0x70
>>> [c0000000260d7c50] c0000000005c37e8 do_splice+0x5a8/0x910
>>> [c0000000260d7cd0] c0000000005c3ce0 sys_splice+0x190/0x300
>>> [c0000000260d7d60] c000000000039ba4 system_call_exception+0x384/0x3d0
>>> [c0000000260d7e10] c00000000000d45c system_call_common+0xec/0x278
>>> --- Exception: c00 (System Call) at 00007ffff72ef170
>>>
>>>
>>> -ritesh
>>>

Ritesh Harjani April 16, 2021, 5:50 a.m. UTC | #27

On 21/04/16 09:34AM, Qu Wenruo wrote:
>
>
> On 2021/4/16 上午7:34, Qu Wenruo wrote:
> >
> >
> > On 2021/4/16 上午7:19, Qu Wenruo wrote:
> > >
> > >
> > > On 2021/4/15 下午10:52, riteshh wrote:
> > > > On 21/04/15 09:14AM, riteshh wrote:
> > > > > On 21/04/12 07:33PM, Qu Wenruo wrote:
> > > > > > Good news, you can fetch the subpage branch for better test results.
> > > > > >
> > > > > > Now the branch should pass all generic tests, except defrag and known
> > > > > > failures.
> > > > > > And no more random crash during the tests.
> > > > >
> > > > > Thanks, let me test it on PPC64 box.
> > > >
> > > > I do see some failures remaining with the patch series.
> > > > However the one which is blocking my testing is the tests/generic/095
> > > > I see kernel BUG hitting with below signature.
> > >
> > > That's pretty different from my tests.
> > >
> > > As I haven't seen such BUG_ON() for a while.
> > >
> > >
> > > >
> > > > Please let me know if this a known failure?
> > > >
> > > > <xfstests config>
> > > > #:~/work-tools/xfstests$ sudo ./check -g auto
> > > > SECTION       -- btrfs_4k
> > > > FSTYP         -- btrfs
> > > > PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
> > > > SMP Thu Apr 15 07:29:23 CDT 2021
> > > > MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
> > >
> > > I see you're using -n 4096, not the default -n 16K, let me see if I can
> > > reproduce that.
> > >
> > > But from the backtrace, it doesn't look like the case,
> > > as it happens for data path, which means it's only related to sectorsize.
> > >
> > > > MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
> > > >
> > > >
> > > > <kernel logs>
> > > > [ 6057.560580] BTRFS warning (device loop3): read-write for sector
> > > > size 4096 with page size 65536 is experimental
> > > > [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
> > > > [ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
> > > > [ 6058.348910] BTRFS info (device loop2): has skinny extents
> > > > [ 6058.351930] BTRFS warning (device loop2): read-write for sector
> > > > size 4096 with page size 65536 is experimental
> > > > [ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
> > > > devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
> > > > [ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
> > > > [ 6060.226213] BTRFS info (device loop3): has skinny extents
> > > > [ 6060.227084] BTRFS warning (device loop3): read-write for sector
> > > > size 4096 with page size 65536 is experimental
> > > > [ 6060.234537] BTRFS info (device loop3): checking UUID tree
> > > > [ 6061.375902] assertion failed: PagePrivate(page) && page->private,
> > > > in fs/btrfs/subpage.c:171
> > > > [ 6061.378296] ------------[ cut here ]------------
> > > > [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
> > > > cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
> > > >      pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
> > > >      lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
> > > >      sp: c0000000260d7730
> > > >     msr: 800000000282b033
> > > >    current = 0xc0000000260c0080
> > > >    paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
> > > >      pid   = 739712, comm = fio
> > > > kernel BUG at fs/btrfs/ctree.h:3403!
> > > > Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc
> > > > (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
> > > > 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
> > > > enter ? for help
> > > > [c0000000260d7790] c000000000a90280
> > > > btrfs_subpage_assert.isra.9+0x70/0x110
> > > > [c0000000260d77b0] c000000000a91064
> > > > btrfs_subpage_set_uptodate+0x54/0x110
> > > > [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
> > >
> > > This is very strange.
> > > As in btrfs_dirty_pages(), the pages passed in are already prepared by
> > > prepare_pages(), which means all of them should have Private set.
> > >
> > > Can you reproduce the bug reliable?

Yes. almost reliably on my PPC box.

> >
> > OK, I got it reproduced.
> >
> > It's not a reliable BUG_ON(), but can be reproduced.
> > The test get skipped for all my boards as it requires fio tool, thus I
> > didn't get it triggered for all previous runs.
> >
> > I'll take a look into the case.
>
> This exposed an interesting race window in btrfs_buffered_write():
>         Writer                    |             fadvice
> ----------------------------------+-------------------------------
> btrfs_buffered_write()            |
> |- prepare_pages()                |
> |  |- Now all pages involved get  |
> |     Private set                 |
> |                                 | btrfs_release_page()
> |                                 | |- Clear page Private
> |- lock_extent()                  |
> |  |- This would prevent          |
> |     btrfs_release_page() to     |
> |     clear the page Private      |
> |
> |- btrfs_dirty_page()
>    |- Will trigger the BUG_ON()


Sorry about the silly query. But help me understand how is above race possible?
Won't prepare_pages() will lock all the pages first. The same requirement
of locked page should be with btrfs_releasepage() too no?

I see only two paths which could result into btrfs_releasepage()
1. one via try_to_release_pages -> releasepage()
2. writeback path calling btrfs_writepage or btrfs_writepages
	which may result into calling of btrfs_invalidatepage()

Although I am not sure which one this is racing with.

>
> This only happens for subpage, because subpage introduces new ASSERT()
> to do extra check.
>
> If we want to speak strictly, regular sector size should also report
> this problem.
> But regular sector size case doesn't really care about page Private, as
> it just set page->private to a constant value, unlike subpage case which
> stores important value.
>
> The fix will just re-set page Private and needed structures in
> btrfs_dirty_page(), under extent locked so no btrfs_releasepage() is
> able to release it anymore.

With above fix I see a different issue with below signature.

[  130.272410] BTRFS warning (device loop2): read-write for sector size 4096 with page size 65536 is experimental
[  130.387470] run fstests generic/095 at 2021-04-16 05:04:09
[  132.042532] BTRFS: device fsid 642daee0-165a-4271-b6f3-728f215c5348 devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (5226)
[  132.146892] BTRFS info (device loop3): disk space caching is enabled
[  132.147831] BTRFS info (device loop3): has skinny extents
[  132.148491] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
[  132.158228] BTRFS info (device loop3): checking UUID tree
[  133.931695] BUG: spinlock bad magic on CPU#4, swapper/4/0
[  133.932874] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
[  133.934432] Faulting instruction address: 0xc000000000283654
cpu 0x4: Vector: 380 (Data SLB Access) at [c000000007937160]
    pc: c000000000283654: spin_dump+0x70/0xbc
    lr: c000000000283638: spin_dump+0x54/0xbc
    sp: c000000007937400
   msr: 8000000000001033
   dar: 6b6b6b6b6b6b725b
  current = 0xc000000007913300
  paca    = 0xc00000003fff9c00   irqmask: 0x03   irq_happened: 0x05
    pid   = 0, comm = swapper/4
Linux version 5.12.0-rc7-02317-g61d9ec0f765 (riteshh@ltctulc6a-p1) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #74 SMP Thu Apr 15 23:52:56 CDT 2021
enter ? for help
[c000000007937470] c000000000283078 do_raw_spin_unlock+0x88/0x230
[c0000000079374a0] c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
[c0000000079374d0] c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
[c000000007937530] c0000000009e0458 end_bio_extent_writepage+0x158/0x270
[c0000000079375f0] c000000000b6fd14 bio_endio+0x254/0x270
[c000000007937630] c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
[c000000007937670] c000000000b6fd14 bio_endio+0x254/0x270
[c0000000079376b0] c000000000b781fc blk_update_request+0x46c/0x670
[c000000007937760] c000000000b8b394 blk_mq_end_request+0x34/0x1d0
[c0000000079377a0] c000000000d82d1c lo_complete_rq+0x11c/0x140
[c0000000079377d0] c000000000b880a4 blk_complete_reqs+0x84/0xb0
[c000000007937800] c0000000012b2ca4 __do_softirq+0x334/0x680
[c000000007937910] c0000000001dd878 irq_exit+0x148/0x1d0
[c000000007937940] c000000000016f4c do_IRQ+0x20c/0x240
[c0000000079379d0] c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0




>
> The fix is already added to the github branch.
> Now it has the fix as the HEAD.
>
> I hope this won't damage your confidence on the patchset.
>
> Thanks for the report!
> Qu
>
> >
> > Thanks for the report,
> > Qu
> > >
> > > BTW, are using running the latest branch, with this commit at top?

Yes. Below branch.
https://github.com/adam900710/linux/commits/subpage

-ritesh

> > >
> > > commit 3490dae50c01cec04364e5288f43ae9ac9eca2c9
> > > Author: Qu Wenruo <wqu@suse.com>
> > > Date:   Mon Feb 22 14:19:38 2021 +0800
> > >
> > >     btrfs: allow read-write for 4K sectorsize on 64K page sizesystems
> > >
> > > As I was updating the patchset until the last minute.
> > >
> > > Thanks,
> > > Qu
> > >
> > > > [c0000000260d7880] c0000000009c7298 btrfs_buffered_write+0x488/0x7f0
> > > > [c0000000260d79d0] c0000000009cbeb4 btrfs_file_write_iter+0x314/0x520
> > > > [c0000000260d7a50] c00000000055fd84 do_iter_readv_writev+0x1b4/0x260
> > > > [c0000000260d7ac0] c00000000056114c do_iter_write+0xdc/0x2c0
> > > > [c0000000260d7b10] c0000000005c2d2c iter_file_splice_write+0x2ec/0x510
> > > > [c0000000260d7c30] c0000000005c1ba0 do_splice_from+0x50/0x70
> > > > [c0000000260d7c50] c0000000005c37e8 do_splice+0x5a8/0x910
> > > > [c0000000260d7cd0] c0000000005c3ce0 sys_splice+0x190/0x300
> > > > [c0000000260d7d60] c000000000039ba4 system_call_exception+0x384/0x3d0
> > > > [c0000000260d7e10] c00000000000d45c system_call_common+0xec/0x278
> > > > --- Exception: c00 (System Call) at 00007ffff72ef170
> > > >
> > > >
> > > > -ritesh
> > > >

Qu Wenruo April 16, 2021, 6:14 a.m. UTC | #28

On 2021/4/16 下午1:50, riteshh wrote:
> On 21/04/16 09:34AM, Qu Wenruo wrote:
>>
>>
>> On 2021/4/16 上午7:34, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/4/16 上午7:19, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2021/4/15 下午10:52, riteshh wrote:
>>>>> On 21/04/15 09:14AM, riteshh wrote:
>>>>>> On 21/04/12 07:33PM, Qu Wenruo wrote:
>>>>>>> Good news, you can fetch the subpage branch for better test results.
>>>>>>>
>>>>>>> Now the branch should pass all generic tests, except defrag and known
>>>>>>> failures.
>>>>>>> And no more random crash during the tests.
>>>>>>
>>>>>> Thanks, let me test it on PPC64 box.
>>>>>
>>>>> I do see some failures remaining with the patch series.
>>>>> However the one which is blocking my testing is the tests/generic/095
>>>>> I see kernel BUG hitting with below signature.
>>>>
>>>> That's pretty different from my tests.
>>>>
>>>> As I haven't seen such BUG_ON() for a while.
>>>>
>>>>
>>>>>
>>>>> Please let me know if this a known failure?
>>>>>
>>>>> <xfstests config>
>>>>> #:~/work-tools/xfstests$ sudo ./check -g auto
>>>>> SECTION       -- btrfs_4k
>>>>> FSTYP         -- btrfs
>>>>> PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
>>>>> SMP Thu Apr 15 07:29:23 CDT 2021
>>>>> MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
>>>>
>>>> I see you're using -n 4096, not the default -n 16K, let me see if I can
>>>> reproduce that.
>>>>
>>>> But from the backtrace, it doesn't look like the case,
>>>> as it happens for data path, which means it's only related to sectorsize.
>>>>
>>>>> MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
>>>>>
>>>>>
>>>>> <kernel logs>
>>>>> [ 6057.560580] BTRFS warning (device loop3): read-write for sector
>>>>> size 4096 with page size 65536 is experimental
>>>>> [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
>>>>> [ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
>>>>> [ 6058.348910] BTRFS info (device loop2): has skinny extents
>>>>> [ 6058.351930] BTRFS warning (device loop2): read-write for sector
>>>>> size 4096 with page size 65536 is experimental
>>>>> [ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
>>>>> devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
>>>>> [ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
>>>>> [ 6060.226213] BTRFS info (device loop3): has skinny extents
>>>>> [ 6060.227084] BTRFS warning (device loop3): read-write for sector
>>>>> size 4096 with page size 65536 is experimental
>>>>> [ 6060.234537] BTRFS info (device loop3): checking UUID tree
>>>>> [ 6061.375902] assertion failed: PagePrivate(page) && page->private,
>>>>> in fs/btrfs/subpage.c:171
>>>>> [ 6061.378296] ------------[ cut here ]------------
>>>>> [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
>>>>> cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
>>>>>       pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
>>>>>       lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
>>>>>       sp: c0000000260d7730
>>>>>      msr: 800000000282b033
>>>>>     current = 0xc0000000260c0080
>>>>>     paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
>>>>>       pid   = 739712, comm = fio
>>>>> kernel BUG at fs/btrfs/ctree.h:3403!
>>>>> Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc
>>>>> (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
>>>>> 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
>>>>> enter ? for help
>>>>> [c0000000260d7790] c000000000a90280
>>>>> btrfs_subpage_assert.isra.9+0x70/0x110
>>>>> [c0000000260d77b0] c000000000a91064
>>>>> btrfs_subpage_set_uptodate+0x54/0x110
>>>>> [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
>>>>
>>>> This is very strange.
>>>> As in btrfs_dirty_pages(), the pages passed in are already prepared by
>>>> prepare_pages(), which means all of them should have Private set.
>>>>
>>>> Can you reproduce the bug reliable?
> 
> Yes. almost reliably on my PPC box.
> 
>>>
>>> OK, I got it reproduced.
>>>
>>> It's not a reliable BUG_ON(), but can be reproduced.
>>> The test get skipped for all my boards as it requires fio tool, thus I
>>> didn't get it triggered for all previous runs.
>>>
>>> I'll take a look into the case.
>>
>> This exposed an interesting race window in btrfs_buffered_write():
>>          Writer                    |             fadvice
>> ----------------------------------+-------------------------------
>> btrfs_buffered_write()            |
>> |- prepare_pages()                |
>> |  |- Now all pages involved get  |
>> |     Private set                 |
>> |                                 | btrfs_release_page()
>> |                                 | |- Clear page Private
>> |- lock_extent()                  |
>> |  |- This would prevent          |
>> |     btrfs_release_page() to     |
>> |     clear the page Private      |
>> |
>> |- btrfs_dirty_page()
>>     |- Will trigger the BUG_ON()
> 
> 
> Sorry about the silly query. But help me understand how is above race possible?
> Won't prepare_pages() will lock all the pages first. The same requirement
> of locked page should be with btrfs_releasepage() too no?

releasepage() call can easily got a page locked and release it.

For call sites like btrfs_invalidatepage(), the page is already locked.

btrfs_releasepage() will not to try to release the page if the extent is 
locked (any extent range inside the page has EXTENT_LOCK bit).

> 
> I see only two paths which could result into btrfs_releasepage()
> 1. one via try_to_release_pages -> releasepage()

This is the race one, called from fadvice() to release pages.

> 2. writeback path calling btrfs_writepage or btrfs_writepages
> 	which may result into calling of btrfs_invalidatepage()

Not this one.

> 
> Although I am not sure which one this is racing with.
> 
>>
>> This only happens for subpage, because subpage introduces new ASSERT()
>> to do extra check.
>>
>> If we want to speak strictly, regular sector size should also report
>> this problem.
>> But regular sector size case doesn't really care about page Private, as
>> it just set page->private to a constant value, unlike subpage case which
>> stores important value.
>>
>> The fix will just re-set page Private and needed structures in
>> btrfs_dirty_page(), under extent locked so no btrfs_releasepage() is
>> able to release it anymore.
> 
> With above fix I see a different issue with below signature.
> 
> [  130.272410] BTRFS warning (device loop2): read-write for sector size 4096 with page size 65536 is experimental
> [  130.387470] run fstests generic/095 at 2021-04-16 05:04:09
> [  132.042532] BTRFS: device fsid 642daee0-165a-4271-b6f3-728f215c5348 devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (5226)
> [  132.146892] BTRFS info (device loop3): disk space caching is enabled
> [  132.147831] BTRFS info (device loop3): has skinny extents
> [  132.148491] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
> [  132.158228] BTRFS info (device loop3): checking UUID tree
> [  133.931695] BUG: spinlock bad magic on CPU#4, swapper/4/0
> [  133.932874] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b

That looks like some poisoned memory.

I have run 128 runs of generic/095 locally on my Arm board during the 
fix, unable to reproduce the crash anymore.

And this call site is even harder to get race, as in endio context, the 
page still has PageWriteback until the last bio finished in the page.

This means btrfs_releasepage() will not even try to release the page, 
while btrfs_invalidatepage() will wait the page to finish its writeback 
before doing anything.

So this is very strange to me.

Any reproducibility on your side? Or something specific to Power is 
related to this case? (IIRC some page flag operation is not atomic, 
maybe that is related?)

Thanks,
Qu
> [  133.934432] Faulting instruction address: 0xc000000000283654
> cpu 0x4: Vector: 380 (Data SLB Access) at [c000000007937160]
>      pc: c000000000283654: spin_dump+0x70/0xbc
>      lr: c000000000283638: spin_dump+0x54/0xbc
>      sp: c000000007937400
>     msr: 8000000000001033
>     dar: 6b6b6b6b6b6b725b
>    current = 0xc000000007913300
>    paca    = 0xc00000003fff9c00   irqmask: 0x03   irq_happened: 0x05
>      pid   = 0, comm = swapper/4
> Linux version 5.12.0-rc7-02317-g61d9ec0f765 (riteshh@ltctulc6a-p1) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #74 SMP Thu Apr 15 23:52:56 CDT 2021
> enter ? for help
> [c000000007937470] c000000000283078 do_raw_spin_unlock+0x88/0x230
> [c0000000079374a0] c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
> [c0000000079374d0] c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
> [c000000007937530] c0000000009e0458 end_bio_extent_writepage+0x158/0x270
> [c0000000079375f0] c000000000b6fd14 bio_endio+0x254/0x270
> [c000000007937630] c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
> [c000000007937670] c000000000b6fd14 bio_endio+0x254/0x270
> [c0000000079376b0] c000000000b781fc blk_update_request+0x46c/0x670
> [c000000007937760] c000000000b8b394 blk_mq_end_request+0x34/0x1d0
> [c0000000079377a0] c000000000d82d1c lo_complete_rq+0x11c/0x140
> [c0000000079377d0] c000000000b880a4 blk_complete_reqs+0x84/0xb0
> [c000000007937800] c0000000012b2ca4 __do_softirq+0x334/0x680
> [c000000007937910] c0000000001dd878 irq_exit+0x148/0x1d0
> [c000000007937940] c000000000016f4c do_IRQ+0x20c/0x240
> [c0000000079379d0] c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0
> 
> 
> 
> 
>>
>> The fix is already added to the github branch.
>> Now it has the fix as the HEAD.
>>
>> I hope this won't damage your confidence on the patchset.
>>
>> Thanks for the report!
>> Qu
>>
>>>
>>> Thanks for the report,
>>> Qu
>>>>
>>>> BTW, are using running the latest branch, with this commit at top?
> 
> Yes. Below branch.
> https://github.com/adam900710/linux/commits/subpage
> 
> -ritesh
> 
>>>>
>>>> commit 3490dae50c01cec04364e5288f43ae9ac9eca2c9
>>>> Author: Qu Wenruo <wqu@suse.com>
>>>> Date:   Mon Feb 22 14:19:38 2021 +0800
>>>>
>>>>      btrfs: allow read-write for 4K sectorsize on 64K page sizesystems
>>>>
>>>> As I was updating the patchset until the last minute.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>> [c0000000260d7880] c0000000009c7298 btrfs_buffered_write+0x488/0x7f0
>>>>> [c0000000260d79d0] c0000000009cbeb4 btrfs_file_write_iter+0x314/0x520
>>>>> [c0000000260d7a50] c00000000055fd84 do_iter_readv_writev+0x1b4/0x260
>>>>> [c0000000260d7ac0] c00000000056114c do_iter_write+0xdc/0x2c0
>>>>> [c0000000260d7b10] c0000000005c2d2c iter_file_splice_write+0x2ec/0x510
>>>>> [c0000000260d7c30] c0000000005c1ba0 do_splice_from+0x50/0x70
>>>>> [c0000000260d7c50] c0000000005c37e8 do_splice+0x5a8/0x910
>>>>> [c0000000260d7cd0] c0000000005c3ce0 sys_splice+0x190/0x300
>>>>> [c0000000260d7d60] c000000000039ba4 system_call_exception+0x384/0x3d0
>>>>> [c0000000260d7e10] c00000000000d45c system_call_common+0xec/0x278
>>>>> --- Exception: c00 (System Call) at 00007ffff72ef170
>>>>>
>>>>>
>>>>> -ritesh
>>>>>
>

Ritesh Harjani April 16, 2021, 4:52 p.m. UTC | #29

On 21/04/16 02:14PM, Qu Wenruo wrote:
>
>
> On 2021/4/16 下午1:50, riteshh wrote:
> > On 21/04/16 09:34AM, Qu Wenruo wrote:
> > >
> > >
> > > On 2021/4/16 上午7:34, Qu Wenruo wrote:
> > > >
> > > >
> > > > On 2021/4/16 上午7:19, Qu Wenruo wrote:
> > > > >
> > > > >
> > > > > On 2021/4/15 下午10:52, riteshh wrote:
> > > > > > On 21/04/15 09:14AM, riteshh wrote:
> > > > > > > On 21/04/12 07:33PM, Qu Wenruo wrote:
> > > > > > > > Good news, you can fetch the subpage branch for better test results.
> > > > > > > >
> > > > > > > > Now the branch should pass all generic tests, except defrag and known
> > > > > > > > failures.
> > > > > > > > And no more random crash during the tests.
> > > > > > >
> > > > > > > Thanks, let me test it on PPC64 box.
> > > > > >
> > > > > > I do see some failures remaining with the patch series.
> > > > > > However the one which is blocking my testing is the tests/generic/095
> > > > > > I see kernel BUG hitting with below signature.
> > > > >
> > > > > That's pretty different from my tests.
> > > > >
> > > > > As I haven't seen such BUG_ON() for a while.
> > > > >
> > > > >
> > > > > >
> > > > > > Please let me know if this a known failure?
> > > > > >
> > > > > > <xfstests config>
> > > > > > #:~/work-tools/xfstests$ sudo ./check -g auto
> > > > > > SECTION       -- btrfs_4k
> > > > > > FSTYP         -- btrfs
> > > > > > PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
> > > > > > SMP Thu Apr 15 07:29:23 CDT 2021
> > > > > > MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
> > > > >
> > > > > I see you're using -n 4096, not the default -n 16K, let me see if I can
> > > > > reproduce that.
> > > > >
> > > > > But from the backtrace, it doesn't look like the case,
> > > > > as it happens for data path, which means it's only related to sectorsize.
> > > > >
> > > > > > MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
> > > > > >
> > > > > >
> > > > > > <kernel logs>
> > > > > > [ 6057.560580] BTRFS warning (device loop3): read-write for sector
> > > > > > size 4096 with page size 65536 is experimental
> > > > > > [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
> > > > > > [ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
> > > > > > [ 6058.348910] BTRFS info (device loop2): has skinny extents
> > > > > > [ 6058.351930] BTRFS warning (device loop2): read-write for sector
> > > > > > size 4096 with page size 65536 is experimental
> > > > > > [ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
> > > > > > devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
> > > > > > [ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
> > > > > > [ 6060.226213] BTRFS info (device loop3): has skinny extents
> > > > > > [ 6060.227084] BTRFS warning (device loop3): read-write for sector
> > > > > > size 4096 with page size 65536 is experimental
> > > > > > [ 6060.234537] BTRFS info (device loop3): checking UUID tree
> > > > > > [ 6061.375902] assertion failed: PagePrivate(page) && page->private,
> > > > > > in fs/btrfs/subpage.c:171
> > > > > > [ 6061.378296] ------------[ cut here ]------------
> > > > > > [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
> > > > > > cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
> > > > > >       pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
> > > > > >       lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
> > > > > >       sp: c0000000260d7730
> > > > > >      msr: 800000000282b033
> > > > > >     current = 0xc0000000260c0080
> > > > > >     paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
> > > > > >       pid   = 739712, comm = fio
> > > > > > kernel BUG at fs/btrfs/ctree.h:3403!
> > > > > > Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc
> > > > > > (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
> > > > > > 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
> > > > > > enter ? for help
> > > > > > [c0000000260d7790] c000000000a90280
> > > > > > btrfs_subpage_assert.isra.9+0x70/0x110
> > > > > > [c0000000260d77b0] c000000000a91064
> > > > > > btrfs_subpage_set_uptodate+0x54/0x110
> > > > > > [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
> > > > >
> > > > > This is very strange.
> > > > > As in btrfs_dirty_pages(), the pages passed in are already prepared by
> > > > > prepare_pages(), which means all of them should have Private set.
> > > > >
> > > > > Can you reproduce the bug reliable?
> >
> > Yes. almost reliably on my PPC box.
> >
> > > >
> > > > OK, I got it reproduced.
> > > >
> > > > It's not a reliable BUG_ON(), but can be reproduced.
> > > > The test get skipped for all my boards as it requires fio tool, thus I
> > > > didn't get it triggered for all previous runs.
> > > >
> > > > I'll take a look into the case.
> > >
> > > This exposed an interesting race window in btrfs_buffered_write():
> > >          Writer                    |             fadvice
> > > ----------------------------------+-------------------------------
> > > btrfs_buffered_write()            |
> > > |- prepare_pages()                |
> > > |  |- Now all pages involved get  |
> > > |     Private set                 |
> > > |                                 | btrfs_release_page()
> > > |                                 | |- Clear page Private
> > > |- lock_extent()                  |
> > > |  |- This would prevent          |
> > > |     btrfs_release_page() to     |
> > > |     clear the page Private      |
> > > |
> > > |- btrfs_dirty_page()
> > >     |- Will trigger the BUG_ON()
> >
> >
> > Sorry about the silly query. But help me understand how is above race possible?
> > Won't prepare_pages() will lock all the pages first. The same requirement
> > of locked page should be with btrfs_releasepage() too no?
>
> releasepage() call can easily got a page locked and release it.
>
> For call sites like btrfs_invalidatepage(), the page is already locked.
>
> btrfs_releasepage() will not to try to release the page if the extent is
> locked (any extent range inside the page has EXTENT_LOCK bit).
>
> >
> > I see only two paths which could result into btrfs_releasepage()
> > 1. one via try_to_release_pages -> releasepage()
>
> This is the race one, called from fadvice() to release pages.
>
> > 2. writeback path calling btrfs_writepage or btrfs_writepages
> > 	which may result into calling of btrfs_invalidatepage()
>
> Not this one.
>
> >
> > Although I am not sure which one this is racing with.
> >
> > >
> > > This only happens for subpage, because subpage introduces new ASSERT()
> > > to do extra check.
> > >
> > > If we want to speak strictly, regular sector size should also report
> > > this problem.
> > > But regular sector size case doesn't really care about page Private, as
> > > it just set page->private to a constant value, unlike subpage case which
> > > stores important value.
> > >
> > > The fix will just re-set page Private and needed structures in
> > > btrfs_dirty_page(), under extent locked so no btrfs_releasepage() is
> > > able to release it anymore.
> >
> > With above fix I see a different issue with below signature.
> >
> > [  130.272410] BTRFS warning (device loop2): read-write for sector size 4096 with page size 65536 is experimental
> > [  130.387470] run fstests generic/095 at 2021-04-16 05:04:09
> > [  132.042532] BTRFS: device fsid 642daee0-165a-4271-b6f3-728f215c5348 devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (5226)
> > [  132.146892] BTRFS info (device loop3): disk space caching is enabled
> > [  132.147831] BTRFS info (device loop3): has skinny extents
> > [  132.148491] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
> > [  132.158228] BTRFS info (device loop3): checking UUID tree
> > [  133.931695] BUG: spinlock bad magic on CPU#4, swapper/4/0
> > [  133.932874] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
>
> That looks like some poisoned memory.
>
> I have run 128 runs of generic/095 locally on my Arm board during the fix,
> unable to reproduce the crash anymore.
>
> And this call site is even harder to get race, as in endio context, the page
> still has PageWriteback until the last bio finished in the page.
>
> This means btrfs_releasepage() will not even try to release the page, while
> btrfs_invalidatepage() will wait the page to finish its writeback before
> doing anything.
>
> So this is very strange to me.
>
> Any reproducibility on your side? Or something specific to Power is related
> to this case? (IIRC some page flag operation is not atomic, maybe that is
> related?)

I doubt if this is Power related. And yes, I can reproduce the issue fairly
easily. For now I will exclude the test from my run to get a overall run with
these patches. Later will try and debug what is going on.

But if you need any debug logs - do let me know, as it is fairly easily
reproducible.

-ritesh

>
> Thanks,
> Qu
> > [  133.934432] Faulting instruction address: 0xc000000000283654
> > cpu 0x4: Vector: 380 (Data SLB Access) at [c000000007937160]
> >      pc: c000000000283654: spin_dump+0x70/0xbc
> >      lr: c000000000283638: spin_dump+0x54/0xbc
> >      sp: c000000007937400
> >     msr: 8000000000001033
> >     dar: 6b6b6b6b6b6b725b
> >    current = 0xc000000007913300
> >    paca    = 0xc00000003fff9c00   irqmask: 0x03   irq_happened: 0x05
> >      pid   = 0, comm = swapper/4
> > Linux version 5.12.0-rc7-02317-g61d9ec0f765 (riteshh@ltctulc6a-p1) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #74 SMP Thu Apr 15 23:52:56 CDT 2021
> > enter ? for help
> > [c000000007937470] c000000000283078 do_raw_spin_unlock+0x88/0x230
> > [c0000000079374a0] c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
> > [c0000000079374d0] c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
> > [c000000007937530] c0000000009e0458 end_bio_extent_writepage+0x158/0x270
> > [c0000000079375f0] c000000000b6fd14 bio_endio+0x254/0x270
> > [c000000007937630] c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
> > [c000000007937670] c000000000b6fd14 bio_endio+0x254/0x270
> > [c0000000079376b0] c000000000b781fc blk_update_request+0x46c/0x670
> > [c000000007937760] c000000000b8b394 blk_mq_end_request+0x34/0x1d0
> > [c0000000079377a0] c000000000d82d1c lo_complete_rq+0x11c/0x140
> > [c0000000079377d0] c000000000b880a4 blk_complete_reqs+0x84/0xb0
> > [c000000007937800] c0000000012b2ca4 __do_softirq+0x334/0x680
> > [c000000007937910] c0000000001dd878 irq_exit+0x148/0x1d0
> > [c000000007937940] c000000000016f4c do_IRQ+0x20c/0x240
> > [c0000000079379d0] c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0
> >
> >
> >
> >
> > >
> > > The fix is already added to the github branch.
> > > Now it has the fix as the HEAD.
> > >
> > > I hope this won't damage your confidence on the patchset.
> > >
> > > Thanks for the report!
> > > Qu
> > >
> > > >
> > > > Thanks for the report,
> > > > Qu
> > > > >
> > > > > BTW, are using running the latest branch, with this commit at top?
> >
> > Yes. Below branch.
> > https://github.com/adam900710/linux/commits/subpage
> >
> > -ritesh
> >
> > > > >
> > > > > commit 3490dae50c01cec04364e5288f43ae9ac9eca2c9
> > > > > Author: Qu Wenruo <wqu@suse.com>
> > > > > Date:   Mon Feb 22 14:19:38 2021 +0800
> > > > >
> > > > >      btrfs: allow read-write for 4K sectorsize on 64K page sizesystems
> > > > >
> > > > > As I was updating the patchset until the last minute.
> > > > >
> > > > > Thanks,
> > > > > Qu
> > > > >
> > > > > > [c0000000260d7880] c0000000009c7298 btrfs_buffered_write+0x488/0x7f0
> > > > > > [c0000000260d79d0] c0000000009cbeb4 btrfs_file_write_iter+0x314/0x520
> > > > > > [c0000000260d7a50] c00000000055fd84 do_iter_readv_writev+0x1b4/0x260
> > > > > > [c0000000260d7ac0] c00000000056114c do_iter_write+0xdc/0x2c0
> > > > > > [c0000000260d7b10] c0000000005c2d2c iter_file_splice_write+0x2ec/0x510
> > > > > > [c0000000260d7c30] c0000000005c1ba0 do_splice_from+0x50/0x70
> > > > > > [c0000000260d7c50] c0000000005c37e8 do_splice+0x5a8/0x910
> > > > > > [c0000000260d7cd0] c0000000005c3ce0 sys_splice+0x190/0x300
> > > > > > [c0000000260d7d60] c000000000039ba4 system_call_exception+0x384/0x3d0
> > > > > > [c0000000260d7e10] c00000000000d45c system_call_common+0xec/0x278
> > > > > > --- Exception: c00 (System Call) at 00007ffff72ef170
> > > > > >
> > > > > >
> > > > > > -ritesh
> > > > > >
> >
>

Ritesh Harjani April 19, 2021, 5:59 a.m. UTC | #30

On 21/04/16 10:22PM, riteshh wrote:
> On 21/04/16 02:14PM, Qu Wenruo wrote:
> >
> >
> > On 2021/4/16 下午1:50, riteshh wrote:
> > > On 21/04/16 09:34AM, Qu Wenruo wrote:
> > > >
> > > >
> > > > On 2021/4/16 上午7:34, Qu Wenruo wrote:
> > > > >
> > > > >
> > > > > On 2021/4/16 上午7:19, Qu Wenruo wrote:
> > > > > >
> > > > > >
> > > > > > On 2021/4/15 下午10:52, riteshh wrote:
> > > > > > > On 21/04/15 09:14AM, riteshh wrote:
> > > > > > > > On 21/04/12 07:33PM, Qu Wenruo wrote:
> > > > > > > > > Good news, you can fetch the subpage branch for better test results.
> > > > > > > > >
> > > > > > > > > Now the branch should pass all generic tests, except defrag and known
> > > > > > > > > failures.
> > > > > > > > > And no more random crash during the tests.
> > > > > > > >
> > > > > > > > Thanks, let me test it on PPC64 box.
> > > > > > >
> > > > > > > I do see some failures remaining with the patch series.
> > > > > > > However the one which is blocking my testing is the tests/generic/095
> > > > > > > I see kernel BUG hitting with below signature.
> > > > > >
> > > > > > That's pretty different from my tests.
> > > > > >
> > > > > > As I haven't seen such BUG_ON() for a while.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Please let me know if this a known failure?
> > > > > > >
> > > > > > > <xfstests config>
> > > > > > > #:~/work-tools/xfstests$ sudo ./check -g auto
> > > > > > > SECTION       -- btrfs_4k
> > > > > > > FSTYP         -- btrfs
> > > > > > > PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
> > > > > > > SMP Thu Apr 15 07:29:23 CDT 2021
> > > > > > > MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
> > > > > >
> > > > > > I see you're using -n 4096, not the default -n 16K, let me see if I can
> > > > > > reproduce that.
> > > > > >
> > > > > > But from the backtrace, it doesn't look like the case,
> > > > > > as it happens for data path, which means it's only related to sectorsize.
> > > > > >
> > > > > > > MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
> > > > > > >
> > > > > > >
> > > > > > > <kernel logs>
> > > > > > > [ 6057.560580] BTRFS warning (device loop3): read-write for sector
> > > > > > > size 4096 with page size 65536 is experimental
> > > > > > > [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
> > > > > > > [ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
> > > > > > > [ 6058.348910] BTRFS info (device loop2): has skinny extents
> > > > > > > [ 6058.351930] BTRFS warning (device loop2): read-write for sector
> > > > > > > size 4096 with page size 65536 is experimental
> > > > > > > [ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
> > > > > > > devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
> > > > > > > [ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
> > > > > > > [ 6060.226213] BTRFS info (device loop3): has skinny extents
> > > > > > > [ 6060.227084] BTRFS warning (device loop3): read-write for sector
> > > > > > > size 4096 with page size 65536 is experimental
> > > > > > > [ 6060.234537] BTRFS info (device loop3): checking UUID tree
> > > > > > > [ 6061.375902] assertion failed: PagePrivate(page) && page->private,
> > > > > > > in fs/btrfs/subpage.c:171
> > > > > > > [ 6061.378296] ------------[ cut here ]------------
> > > > > > > [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
> > > > > > > cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
> > > > > > >       pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
> > > > > > >       lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
> > > > > > >       sp: c0000000260d7730
> > > > > > >      msr: 800000000282b033
> > > > > > >     current = 0xc0000000260c0080
> > > > > > >     paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
> > > > > > >       pid   = 739712, comm = fio
> > > > > > > kernel BUG at fs/btrfs/ctree.h:3403!
> > > > > > > Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc
> > > > > > > (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
> > > > > > > 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
> > > > > > > enter ? for help
> > > > > > > [c0000000260d7790] c000000000a90280
> > > > > > > btrfs_subpage_assert.isra.9+0x70/0x110
> > > > > > > [c0000000260d77b0] c000000000a91064
> > > > > > > btrfs_subpage_set_uptodate+0x54/0x110
> > > > > > > [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
> > > > > >
> > > > > > This is very strange.
> > > > > > As in btrfs_dirty_pages(), the pages passed in are already prepared by
> > > > > > prepare_pages(), which means all of them should have Private set.
> > > > > >
> > > > > > Can you reproduce the bug reliable?
> > >
> > > Yes. almost reliably on my PPC box.
> > >
> > > > >
> > > > > OK, I got it reproduced.
> > > > >
> > > > > It's not a reliable BUG_ON(), but can be reproduced.
> > > > > The test get skipped for all my boards as it requires fio tool, thus I
> > > > > didn't get it triggered for all previous runs.
> > > > >
> > > > > I'll take a look into the case.
> > > >
> > > > This exposed an interesting race window in btrfs_buffered_write():
> > > >          Writer                    |             fadvice
> > > > ----------------------------------+-------------------------------
> > > > btrfs_buffered_write()            |
> > > > |- prepare_pages()                |
> > > > |  |- Now all pages involved get  |
> > > > |     Private set                 |
> > > > |                                 | btrfs_release_page()
> > > > |                                 | |- Clear page Private
> > > > |- lock_extent()                  |
> > > > |  |- This would prevent          |
> > > > |     btrfs_release_page() to     |
> > > > |     clear the page Private      |
> > > > |
> > > > |- btrfs_dirty_page()
> > > >     |- Will trigger the BUG_ON()
> > >
> > >
> > > Sorry about the silly query. But help me understand how is above race possible?
> > > Won't prepare_pages() will lock all the pages first. The same requirement
> > > of locked page should be with btrfs_releasepage() too no?
> >
> > releasepage() call can easily got a page locked and release it.
> >
> > For call sites like btrfs_invalidatepage(), the page is already locked.
> >
> > btrfs_releasepage() will not to try to release the page if the extent is
> > locked (any extent range inside the page has EXTENT_LOCK bit).
> >
> > >
> > > I see only two paths which could result into btrfs_releasepage()
> > > 1. one via try_to_release_pages -> releasepage()
> >
> > This is the race one, called from fadvice() to release pages.
> >
> > > 2. writeback path calling btrfs_writepage or btrfs_writepages
> > > 	which may result into calling of btrfs_invalidatepage()
> >
> > Not this one.
> >
> > >
> > > Although I am not sure which one this is racing with.
> > >
> > > >
> > > > This only happens for subpage, because subpage introduces new ASSERT()
> > > > to do extra check.
> > > >
> > > > If we want to speak strictly, regular sector size should also report
> > > > this problem.
> > > > But regular sector size case doesn't really care about page Private, as
> > > > it just set page->private to a constant value, unlike subpage case which
> > > > stores important value.
> > > >
> > > > The fix will just re-set page Private and needed structures in
> > > > btrfs_dirty_page(), under extent locked so no btrfs_releasepage() is
> > > > able to release it anymore.
> > >
> > > With above fix I see a different issue with below signature.
> > >
> > > [  130.272410] BTRFS warning (device loop2): read-write for sector size 4096 with page size 65536 is experimental
> > > [  130.387470] run fstests generic/095 at 2021-04-16 05:04:09
> > > [  132.042532] BTRFS: device fsid 642daee0-165a-4271-b6f3-728f215c5348 devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (5226)
> > > [  132.146892] BTRFS info (device loop3): disk space caching is enabled
> > > [  132.147831] BTRFS info (device loop3): has skinny extents
> > > [  132.148491] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
> > > [  132.158228] BTRFS info (device loop3): checking UUID tree
> > > [  133.931695] BUG: spinlock bad magic on CPU#4, swapper/4/0
> > > [  133.932874] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
> >
> > That looks like some poisoned memory.
> >
> > I have run 128 runs of generic/095 locally on my Arm board during the fix,
> > unable to reproduce the crash anymore.
> >
> > And this call site is even harder to get race, as in endio context, the page
> > still has PageWriteback until the last bio finished in the page.
> >
> > This means btrfs_releasepage() will not even try to release the page, while
> > btrfs_invalidatepage() will wait the page to finish its writeback before
> > doing anything.
> >
> > So this is very strange to me.
> >
> > Any reproducibility on your side? Or something specific to Power is related
> > to this case? (IIRC some page flag operation is not atomic, maybe that is
> > related?)
>
> I doubt if this is Power related. And yes, I can reproduce the issue fairly
> easily. For now I will exclude the test from my run to get a overall run with

Here, are some other failures that I noticed during testing on Power.
Thanks for looking into this.

1. tests/btrfs/052
btrfs/052       [failed, exit status 1]- output mismatch (see /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad)
    --- tests/btrfs/052.out     2020-08-04 09:59:08.328299552 +0000
    +++ /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad      2021-04-16 17:18:17.762928432 +0000
    @@ -91,553 +91,5 @@
     23 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05
     *
     30
    -0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    -*
    -2 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
    -*
    ...
    (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/052.out /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad'  to see the entire diff)

^^^ this could also be due to below error found in 052.full
	ERROR: defrag range ioctl not supported in this kernel version, 2.6.33 and newer is required
	total 1 failures
	failed: '/usr/local/bin/btrfs filesystem defragment /mnt1/scratch/foo'

2. tests/btrfs/076 => looks a genuine failure.
btrfs/076       - output mismatch (see /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad)
    --- tests/btrfs/076.out     2020-08-04 09:59:08.338299786 +0000
    +++ /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad      2021-04-16 17:19:33.344981383 +0000
    @@ -1,3 +1,3 @@
     QA output created by 076
    -80
    -80
    +1
    +1
    ...
    (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/076.out /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad'  to see the entire diff)

3. tests/btrfs/106  => looks a genuine failure.
btrfs/106       - output mismatch (see /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad)
    --- tests/btrfs/106.out     2020-08-04 09:59:08.348300020 +0000
    +++ /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad      2021-04-16 17:49:27.296128823 +0000
    @@ -5,19 +5,19 @@
     File contents before unmount:
     0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
     *
    -40
    +1000
     File contents after remount:
     0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
    ...
    (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/106.out /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad'  to see the entire diff)

> these patches. Later will try and debug what is going on.
>
> But if you need any debug logs - do let me know, as it is fairly easily
> reproducible.

For tests/generic/095 can you pls retry reproducing the issue (with your latest
patch) on your setup with below configs enabled?
1. CONFIG_PAGE_OWNER, CONFIG_PAGE_POISONING, CONFIG_SLUB_DEBUG_ON,
   CONFIG_SCHED_STACK_END_CHECK, CONFIG_DEBUG_VM, CONFIG_DEBUG_STACKOVERFLOW,
   CONFIG_DEBUG_VM_PGFLAGS, CONFIG_DEBUG_SPINLOCK, CONFIG_PROVE_LOCKING


-ritesh

Qu Wenruo April 19, 2021, 6:16 a.m. UTC | #31

On 2021/4/19 下午1:59, riteshh wrote:
> On 21/04/16 10:22PM, riteshh wrote:
>> On 21/04/16 02:14PM, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/4/16 下午1:50, riteshh wrote:
>>>> On 21/04/16 09:34AM, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> On 2021/4/16 上午7:34, Qu Wenruo wrote:
>>>>>>
>>>>>>
>>>>>> On 2021/4/16 上午7:19, Qu Wenruo wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2021/4/15 下午10:52, riteshh wrote:
>>>>>>>> On 21/04/15 09:14AM, riteshh wrote:
>>>>>>>>> On 21/04/12 07:33PM, Qu Wenruo wrote:
>>>>>>>>>> Good news, you can fetch the subpage branch for better test results.
>>>>>>>>>>
>>>>>>>>>> Now the branch should pass all generic tests, except defrag and known
>>>>>>>>>> failures.
>>>>>>>>>> And no more random crash during the tests.
>>>>>>>>>
>>>>>>>>> Thanks, let me test it on PPC64 box.
>>>>>>>>
>>>>>>>> I do see some failures remaining with the patch series.
>>>>>>>> However the one which is blocking my testing is the tests/generic/095
>>>>>>>> I see kernel BUG hitting with below signature.
>>>>>>>
>>>>>>> That's pretty different from my tests.
>>>>>>>
>>>>>>> As I haven't seen such BUG_ON() for a while.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Please let me know if this a known failure?
>>>>>>>>
>>>>>>>> <xfstests config>
>>>>>>>> #:~/work-tools/xfstests$ sudo ./check -g auto
>>>>>>>> SECTION       -- btrfs_4k
>>>>>>>> FSTYP         -- btrfs
>>>>>>>> PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
>>>>>>>> SMP Thu Apr 15 07:29:23 CDT 2021
>>>>>>>> MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
>>>>>>>
>>>>>>> I see you're using -n 4096, not the default -n 16K, let me see if I can
>>>>>>> reproduce that.
>>>>>>>
>>>>>>> But from the backtrace, it doesn't look like the case,
>>>>>>> as it happens for data path, which means it's only related to sectorsize.
>>>>>>>
>>>>>>>> MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
>>>>>>>>
>>>>>>>>
>>>>>>>> <kernel logs>
>>>>>>>> [ 6057.560580] BTRFS warning (device loop3): read-write for sector
>>>>>>>> size 4096 with page size 65536 is experimental
>>>>>>>> [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
>>>>>>>> [ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
>>>>>>>> [ 6058.348910] BTRFS info (device loop2): has skinny extents
>>>>>>>> [ 6058.351930] BTRFS warning (device loop2): read-write for sector
>>>>>>>> size 4096 with page size 65536 is experimental
>>>>>>>> [ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
>>>>>>>> devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
>>>>>>>> [ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
>>>>>>>> [ 6060.226213] BTRFS info (device loop3): has skinny extents
>>>>>>>> [ 6060.227084] BTRFS warning (device loop3): read-write for sector
>>>>>>>> size 4096 with page size 65536 is experimental
>>>>>>>> [ 6060.234537] BTRFS info (device loop3): checking UUID tree
>>>>>>>> [ 6061.375902] assertion failed: PagePrivate(page) && page->private,
>>>>>>>> in fs/btrfs/subpage.c:171
>>>>>>>> [ 6061.378296] ------------[ cut here ]------------
>>>>>>>> [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
>>>>>>>> cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
>>>>>>>>        pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
>>>>>>>>        lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
>>>>>>>>        sp: c0000000260d7730
>>>>>>>>       msr: 800000000282b033
>>>>>>>>      current = 0xc0000000260c0080
>>>>>>>>      paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
>>>>>>>>        pid   = 739712, comm = fio
>>>>>>>> kernel BUG at fs/btrfs/ctree.h:3403!
>>>>>>>> Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc
>>>>>>>> (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
>>>>>>>> 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
>>>>>>>> enter ? for help
>>>>>>>> [c0000000260d7790] c000000000a90280
>>>>>>>> btrfs_subpage_assert.isra.9+0x70/0x110
>>>>>>>> [c0000000260d77b0] c000000000a91064
>>>>>>>> btrfs_subpage_set_uptodate+0x54/0x110
>>>>>>>> [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
>>>>>>>
>>>>>>> This is very strange.
>>>>>>> As in btrfs_dirty_pages(), the pages passed in are already prepared by
>>>>>>> prepare_pages(), which means all of them should have Private set.
>>>>>>>
>>>>>>> Can you reproduce the bug reliable?
>>>>
>>>> Yes. almost reliably on my PPC box.
>>>>
>>>>>>
>>>>>> OK, I got it reproduced.
>>>>>>
>>>>>> It's not a reliable BUG_ON(), but can be reproduced.
>>>>>> The test get skipped for all my boards as it requires fio tool, thus I
>>>>>> didn't get it triggered for all previous runs.
>>>>>>
>>>>>> I'll take a look into the case.
>>>>>
>>>>> This exposed an interesting race window in btrfs_buffered_write():
>>>>>           Writer                    |             fadvice
>>>>> ----------------------------------+-------------------------------
>>>>> btrfs_buffered_write()            |
>>>>> |- prepare_pages()                |
>>>>> |  |- Now all pages involved get  |
>>>>> |     Private set                 |
>>>>> |                                 | btrfs_release_page()
>>>>> |                                 | |- Clear page Private
>>>>> |- lock_extent()                  |
>>>>> |  |- This would prevent          |
>>>>> |     btrfs_release_page() to     |
>>>>> |     clear the page Private      |
>>>>> |
>>>>> |- btrfs_dirty_page()
>>>>>      |- Will trigger the BUG_ON()
>>>>
>>>>
>>>> Sorry about the silly query. But help me understand how is above race possible?
>>>> Won't prepare_pages() will lock all the pages first. The same requirement
>>>> of locked page should be with btrfs_releasepage() too no?
>>>
>>> releasepage() call can easily got a page locked and release it.
>>>
>>> For call sites like btrfs_invalidatepage(), the page is already locked.
>>>
>>> btrfs_releasepage() will not to try to release the page if the extent is
>>> locked (any extent range inside the page has EXTENT_LOCK bit).
>>>
>>>>
>>>> I see only two paths which could result into btrfs_releasepage()
>>>> 1. one via try_to_release_pages -> releasepage()
>>>
>>> This is the race one, called from fadvice() to release pages.
>>>
>>>> 2. writeback path calling btrfs_writepage or btrfs_writepages
>>>> 	which may result into calling of btrfs_invalidatepage()
>>>
>>> Not this one.
>>>
>>>>
>>>> Although I am not sure which one this is racing with.
>>>>
>>>>>
>>>>> This only happens for subpage, because subpage introduces new ASSERT()
>>>>> to do extra check.
>>>>>
>>>>> If we want to speak strictly, regular sector size should also report
>>>>> this problem.
>>>>> But regular sector size case doesn't really care about page Private, as
>>>>> it just set page->private to a constant value, unlike subpage case which
>>>>> stores important value.
>>>>>
>>>>> The fix will just re-set page Private and needed structures in
>>>>> btrfs_dirty_page(), under extent locked so no btrfs_releasepage() is
>>>>> able to release it anymore.
>>>>
>>>> With above fix I see a different issue with below signature.
>>>>
>>>> [  130.272410] BTRFS warning (device loop2): read-write for sector size 4096 with page size 65536 is experimental
>>>> [  130.387470] run fstests generic/095 at 2021-04-16 05:04:09
>>>> [  132.042532] BTRFS: device fsid 642daee0-165a-4271-b6f3-728f215c5348 devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (5226)
>>>> [  132.146892] BTRFS info (device loop3): disk space caching is enabled
>>>> [  132.147831] BTRFS info (device loop3): has skinny extents
>>>> [  132.148491] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
>>>> [  132.158228] BTRFS info (device loop3): checking UUID tree
>>>> [  133.931695] BUG: spinlock bad magic on CPU#4, swapper/4/0
>>>> [  133.932874] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
>>>
>>> That looks like some poisoned memory.
>>>
>>> I have run 128 runs of generic/095 locally on my Arm board during the fix,
>>> unable to reproduce the crash anymore.
>>>
>>> And this call site is even harder to get race, as in endio context, the page
>>> still has PageWriteback until the last bio finished in the page.
>>>
>>> This means btrfs_releasepage() will not even try to release the page, while
>>> btrfs_invalidatepage() will wait the page to finish its writeback before
>>> doing anything.
>>>
>>> So this is very strange to me.
>>>
>>> Any reproducibility on your side? Or something specific to Power is related
>>> to this case? (IIRC some page flag operation is not atomic, maybe that is
>>> related?)
>>
>> I doubt if this is Power related. And yes, I can reproduce the issue fairly
>> easily. For now I will exclude the test from my run to get a overall run with
>
> Here, are some other failures that I noticed during testing on Power.
> Thanks for looking into this.

Thank you very much for the extra test!

>
> 1. tests/btrfs/052
> btrfs/052       [failed, exit status 1]- output mismatch (see /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad)
>      --- tests/btrfs/052.out     2020-08-04 09:59:08.328299552 +0000
>      +++ /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad      2021-04-16 17:18:17.762928432 +0000
>      @@ -91,553 +91,5 @@
>       23 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05
>       *
>       30
>      -0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>      -*
>      -2 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
>      -*
>      ...
>      (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/052.out /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad'  to see the entire diff)
>
> ^^^ this could also be due to below error found in 052.full
> 	ERROR: defrag range ioctl not supported in this kernel version, 2.6.33 and newer is required
> 	total 1 failures
> 	failed: '/usr/local/bin/btrfs filesystem defragment /mnt1/scratch/foo'
>
> 2. tests/btrfs/076 => looks a genuine failure.
> btrfs/076       - output mismatch (see /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad)
>      --- tests/btrfs/076.out     2020-08-04 09:59:08.338299786 +0000
>      +++ /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad      2021-04-16 17:19:33.344981383 +0000
>      @@ -1,3 +1,3 @@
>       QA output created by 076
>      -80
>      -80
>      +1
>      +1
>      ...
>      (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/076.out /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad'  to see the entire diff)

This is really a compression related one. Since I hardcoded to disable
compression, the ratio is always be 1.

>
> 3. tests/btrfs/106  => looks a genuine failure.
> btrfs/106       - output mismatch (see /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad)
>      --- tests/btrfs/106.out     2020-08-04 09:59:08.348300020 +0000
>      +++ /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad      2021-04-16 17:49:27.296128823 +0000
>      @@ -5,19 +5,19 @@
>       File contents before unmount:
>       0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>       *
>      -40
>      +1000
>       File contents after remount:
>       0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>      ...
>      (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/106.out /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad'  to see the entire diff)

That's a similar problem, compression needed
  while compression is hard coded to be disable, thus clone reports
different value.

>
>> these patches. Later will try and debug what is going on.
>>
>> But if you need any debug logs - do let me know, as it is fairly easily
>> reproducible.
>
> For tests/generic/095 can you pls retry reproducing the issue (with your latest
> patch) on your setup with below configs enabled?
> 1. CONFIG_PAGE_OWNER, CONFIG_PAGE_POISONING, CONFIG_SLUB_DEBUG_ON,
>     CONFIG_SCHED_STACK_END_CHECK, CONFIG_DEBUG_VM, CONFIG_DEBUG_STACKOVERFLOW,
>     CONFIG_DEBUG_VM_PGFLAGS, CONFIG_DEBUG_SPINLOCK, CONFIG_PROVE_LOCKING

Thanks, I'll retry using the extra debugging options.

But I have a more solid explanation on why the bug happens now.

You're right, prepare_pages() should have the page locked by calling
find_or_create_page(), so btrfs_releasepage() shouldn't sneak in and
just release the page.

But there is a small window in prepare_uptodate_page(), where we may
call btrfs_readpage(), which will unlock the page.

So there is a window where we have page unlocked, before we re-lock it
in prepare_uptodate_page().

By that, we got a page with its Private bit cleared.

I'm trying a better fix like the following diff.
But I'm not yet 100% confident if the PagePrivate() check is enough,
thus I'll do more test before sending the proper fix.

Thanks,
Qu

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 45ec3f5ef839..49f78d643392 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode *inode,
                         unlock_page(page);
                         return -EIO;
                 }
-               if (page->mapping != inode->i_mapping) {
+
+               /*
+                * Since btrfs_readpage() will get the page unlocked, we
have
+                * a window where fadvice() can try to release the page.
+                * Here we check both inode mapping and PagePrivate() to
+                * make sure the page is not released.
+                *
+                * The priavte flag check is essential for subpage as we
need
+                * to store extra bitmap using page->private.
+                */
+               if (page->mapping != inode->i_mapping ||
PagePrivate(page)) {
                         unlock_page(page);
                         return -EAGAIN;
                 }


>
>
> -ritesh
>

Ritesh Harjani April 19, 2021, 7:04 a.m. UTC | #32

On 21/04/19 02:16PM, Qu Wenruo wrote:
>
>
> On 2021/4/19 下午1:59, riteshh wrote:
> > On 21/04/16 10:22PM, riteshh wrote:
> > > On 21/04/16 02:14PM, Qu Wenruo wrote:
> > > >
> > > >
> > > > On 2021/4/16 下午1:50, riteshh wrote:
> > > > > On 21/04/16 09:34AM, Qu Wenruo wrote:
> > > > > >
> > > > > >
> > > > > > On 2021/4/16 上午7:34, Qu Wenruo wrote:
> > > > > > >
> > > > > > >
> > > > > > > On 2021/4/16 上午7:19, Qu Wenruo wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > On 2021/4/15 下午10:52, riteshh wrote:
> > > > > > > > > On 21/04/15 09:14AM, riteshh wrote:
> > > > > > > > > > On 21/04/12 07:33PM, Qu Wenruo wrote:
> > > > > > > > > > > Good news, you can fetch the subpage branch for better test results.
> > > > > > > > > > >
> > > > > > > > > > > Now the branch should pass all generic tests, except defrag and known
> > > > > > > > > > > failures.
> > > > > > > > > > > And no more random crash during the tests.
> > > > > > > > > >
> > > > > > > > > > Thanks, let me test it on PPC64 box.
> > > > > > > > >
> > > > > > > > > I do see some failures remaining with the patch series.
> > > > > > > > > However the one which is blocking my testing is the tests/generic/095
> > > > > > > > > I see kernel BUG hitting with below signature.
> > > > > > > >
> > > > > > > > That's pretty different from my tests.
> > > > > > > >
> > > > > > > > As I haven't seen such BUG_ON() for a while.
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Please let me know if this a known failure?
> > > > > > > > >
> > > > > > > > > <xfstests config>
> > > > > > > > > #:~/work-tools/xfstests$ sudo ./check -g auto
> > > > > > > > > SECTION       -- btrfs_4k
> > > > > > > > > FSTYP         -- btrfs
> > > > > > > > > PLATFORM      -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
> > > > > > > > > SMP Thu Apr 15 07:29:23 CDT 2021
> > > > > > > > > MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
> > > > > > > >
> > > > > > > > I see you're using -n 4096, not the default -n 16K, let me see if I can
> > > > > > > > reproduce that.
> > > > > > > >
> > > > > > > > But from the backtrace, it doesn't look like the case,
> > > > > > > > as it happens for data path, which means it's only related to sectorsize.
> > > > > > > >
> > > > > > > > > MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > <kernel logs>
> > > > > > > > > [ 6057.560580] BTRFS warning (device loop3): read-write for sector
> > > > > > > > > size 4096 with page size 65536 is experimental
> > > > > > > > > [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
> > > > > > > > > [ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
> > > > > > > > > [ 6058.348910] BTRFS info (device loop2): has skinny extents
> > > > > > > > > [ 6058.351930] BTRFS warning (device loop2): read-write for sector
> > > > > > > > > size 4096 with page size 65536 is experimental
> > > > > > > > > [ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
> > > > > > > > > devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
> > > > > > > > > [ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
> > > > > > > > > [ 6060.226213] BTRFS info (device loop3): has skinny extents
> > > > > > > > > [ 6060.227084] BTRFS warning (device loop3): read-write for sector
> > > > > > > > > size 4096 with page size 65536 is experimental
> > > > > > > > > [ 6060.234537] BTRFS info (device loop3): checking UUID tree
> > > > > > > > > [ 6061.375902] assertion failed: PagePrivate(page) && page->private,
> > > > > > > > > in fs/btrfs/subpage.c:171
> > > > > > > > > [ 6061.378296] ------------[ cut here ]------------
> > > > > > > > > [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
> > > > > > > > > cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
> > > > > > > > >        pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
> > > > > > > > >        lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
> > > > > > > > >        sp: c0000000260d7730
> > > > > > > > >       msr: 800000000282b033
> > > > > > > > >      current = 0xc0000000260c0080
> > > > > > > > >      paca    = 0xc00000003fff8a00   irqmask: 0x03   irq_happened: 0x01
> > > > > > > > >        pid   = 739712, comm = fio
> > > > > > > > > kernel BUG at fs/btrfs/ctree.h:3403!
> > > > > > > > > Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc
> > > > > > > > > (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
> > > > > > > > > 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
> > > > > > > > > enter ? for help
> > > > > > > > > [c0000000260d7790] c000000000a90280
> > > > > > > > > btrfs_subpage_assert.isra.9+0x70/0x110
> > > > > > > > > [c0000000260d77b0] c000000000a91064
> > > > > > > > > btrfs_subpage_set_uptodate+0x54/0x110
> > > > > > > > > [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
> > > > > > > >
> > > > > > > > This is very strange.
> > > > > > > > As in btrfs_dirty_pages(), the pages passed in are already prepared by
> > > > > > > > prepare_pages(), which means all of them should have Private set.
> > > > > > > >
> > > > > > > > Can you reproduce the bug reliable?
> > > > >
> > > > > Yes. almost reliably on my PPC box.
> > > > >
> > > > > > >
> > > > > > > OK, I got it reproduced.
> > > > > > >
> > > > > > > It's not a reliable BUG_ON(), but can be reproduced.
> > > > > > > The test get skipped for all my boards as it requires fio tool, thus I
> > > > > > > didn't get it triggered for all previous runs.
> > > > > > >
> > > > > > > I'll take a look into the case.
> > > > > >
> > > > > > This exposed an interesting race window in btrfs_buffered_write():
> > > > > >           Writer                    |             fadvice
> > > > > > ----------------------------------+-------------------------------
> > > > > > btrfs_buffered_write()            |
> > > > > > |- prepare_pages()                |
> > > > > > |  |- Now all pages involved get  |
> > > > > > |     Private set                 |
> > > > > > |                                 | btrfs_release_page()
> > > > > > |                                 | |- Clear page Private
> > > > > > |- lock_extent()                  |
> > > > > > |  |- This would prevent          |
> > > > > > |     btrfs_release_page() to     |
> > > > > > |     clear the page Private      |
> > > > > > |
> > > > > > |- btrfs_dirty_page()
> > > > > >      |- Will trigger the BUG_ON()
> > > > >
> > > > >
> > > > > Sorry about the silly query. But help me understand how is above race possible?
> > > > > Won't prepare_pages() will lock all the pages first. The same requirement
> > > > > of locked page should be with btrfs_releasepage() too no?
> > > >
> > > > releasepage() call can easily got a page locked and release it.
> > > >
> > > > For call sites like btrfs_invalidatepage(), the page is already locked.
> > > >
> > > > btrfs_releasepage() will not to try to release the page if the extent is
> > > > locked (any extent range inside the page has EXTENT_LOCK bit).
> > > >
> > > > >
> > > > > I see only two paths which could result into btrfs_releasepage()
> > > > > 1. one via try_to_release_pages -> releasepage()
> > > >
> > > > This is the race one, called from fadvice() to release pages.
> > > >
> > > > > 2. writeback path calling btrfs_writepage or btrfs_writepages
> > > > > 	which may result into calling of btrfs_invalidatepage()
> > > >
> > > > Not this one.
> > > >
> > > > >
> > > > > Although I am not sure which one this is racing with.
> > > > >
> > > > > >
> > > > > > This only happens for subpage, because subpage introduces new ASSERT()
> > > > > > to do extra check.
> > > > > >
> > > > > > If we want to speak strictly, regular sector size should also report
> > > > > > this problem.
> > > > > > But regular sector size case doesn't really care about page Private, as
> > > > > > it just set page->private to a constant value, unlike subpage case which
> > > > > > stores important value.
> > > > > >
> > > > > > The fix will just re-set page Private and needed structures in
> > > > > > btrfs_dirty_page(), under extent locked so no btrfs_releasepage() is
> > > > > > able to release it anymore.
> > > > >
> > > > > With above fix I see a different issue with below signature.
> > > > >
> > > > > [  130.272410] BTRFS warning (device loop2): read-write for sector size 4096 with page size 65536 is experimental
> > > > > [  130.387470] run fstests generic/095 at 2021-04-16 05:04:09
> > > > > [  132.042532] BTRFS: device fsid 642daee0-165a-4271-b6f3-728f215c5348 devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (5226)
> > > > > [  132.146892] BTRFS info (device loop3): disk space caching is enabled
> > > > > [  132.147831] BTRFS info (device loop3): has skinny extents
> > > > > [  132.148491] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
> > > > > [  132.158228] BTRFS info (device loop3): checking UUID tree
> > > > > [  133.931695] BUG: spinlock bad magic on CPU#4, swapper/4/0
> > > > > [  133.932874] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
> > > >
> > > > That looks like some poisoned memory.
> > > >
> > > > I have run 128 runs of generic/095 locally on my Arm board during the fix,
> > > > unable to reproduce the crash anymore.
> > > >
> > > > And this call site is even harder to get race, as in endio context, the page
> > > > still has PageWriteback until the last bio finished in the page.
> > > >
> > > > This means btrfs_releasepage() will not even try to release the page, while
> > > > btrfs_invalidatepage() will wait the page to finish its writeback before
> > > > doing anything.
> > > >
> > > > So this is very strange to me.
> > > >
> > > > Any reproducibility on your side? Or something specific to Power is related
> > > > to this case? (IIRC some page flag operation is not atomic, maybe that is
> > > > related?)
> > >
> > > I doubt if this is Power related. And yes, I can reproduce the issue fairly
> > > easily. For now I will exclude the test from my run to get a overall run with
> >
> > Here, are some other failures that I noticed during testing on Power.
> > Thanks for looking into this.
>
> Thank you very much for the extra test!
>
> >
> > 1. tests/btrfs/052
> > btrfs/052       [failed, exit status 1]- output mismatch (see /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad)
> >      --- tests/btrfs/052.out     2020-08-04 09:59:08.328299552 +0000
> >      +++ /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad      2021-04-16 17:18:17.762928432 +0000
> >      @@ -91,553 +91,5 @@
> >       23 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05
> >       *
> >       30
> >      -0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> >      -*
> >      -2 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
> >      -*
> >      ...
> >      (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/052.out /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad'  to see the entire diff)
> >
> > ^^^ this could also be due to below error found in 052.full
> > 	ERROR: defrag range ioctl not supported in this kernel version, 2.6.33 and newer is required
> > 	total 1 failures
> > 	failed: '/usr/local/bin/btrfs filesystem defragment /mnt1/scratch/foo'
> >
> > 2. tests/btrfs/076 => looks a genuine failure.
> > btrfs/076       - output mismatch (see /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad)
> >      --- tests/btrfs/076.out     2020-08-04 09:59:08.338299786 +0000
> >      +++ /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad      2021-04-16 17:19:33.344981383 +0000
> >      @@ -1,3 +1,3 @@
> >       QA output created by 076
> >      -80
> >      -80
> >      +1
> >      +1
> >      ...
> >      (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/076.out /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad'  to see the entire diff)
>
> This is really a compression related one. Since I hardcoded to disable
> compression, the ratio is always be 1.

Ok, thanks.

>
> >
> > 3. tests/btrfs/106  => looks a genuine failure.
> > btrfs/106       - output mismatch (see /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad)
> >      --- tests/btrfs/106.out     2020-08-04 09:59:08.348300020 +0000
> >      +++ /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad      2021-04-16 17:49:27.296128823 +0000
> >      @@ -5,19 +5,19 @@
> >       File contents before unmount:
> >       0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> >       *
> >      -40
> >      +1000
> >       File contents after remount:
> >       0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> >      ...
> >      (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/106.out /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad'  to see the entire diff)
>
> That's a similar problem, compression needed
>  while compression is hard coded to be disable, thus clone reports
> different value.

Ok, then maybe we need to be able to tell the fstests that the compression is
disabled. Or find someway so that these tests doesn't showup as failures.

>
> >
> > > these patches. Later will try and debug what is going on.
> > >
> > > But if you need any debug logs - do let me know, as it is fairly easily
> > > reproducible.
> >
> > For tests/generic/095 can you pls retry reproducing the issue (with your latest
> > patch) on your setup with below configs enabled?
> > 1. CONFIG_PAGE_OWNER, CONFIG_PAGE_POISONING, CONFIG_SLUB_DEBUG_ON,
> >     CONFIG_SCHED_STACK_END_CHECK, CONFIG_DEBUG_VM, CONFIG_DEBUG_STACKOVERFLOW,
> >     CONFIG_DEBUG_VM_PGFLAGS, CONFIG_DEBUG_SPINLOCK, CONFIG_PROVE_LOCKING
>
> Thanks, I'll retry using the extra debugging options.
>
> But I have a more solid explanation on why the bug happens now.
>
> You're right, prepare_pages() should have the page locked by calling
> find_or_create_page(), so btrfs_releasepage() shouldn't sneak in and
> just release the page.
>
> But there is a small window in prepare_uptodate_page(), where we may
> call btrfs_readpage(), which will unlock the page.
>
> So there is a window where we have page unlocked, before we re-lock it
> in prepare_uptodate_page().
>
> By that, we got a page with its Private bit cleared.

Thanks for the explanation.

>
> I'm trying a better fix like the following diff.
> But I'm not yet 100% confident if the PagePrivate() check is enough,
> thus I'll do more test before sending the proper fix.

Sure, that will be helpful. Once you have the fix, I can help with the testing
on my machine.

>
> Thanks,
> Qu
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 45ec3f5ef839..49f78d643392 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode *inode,
>                         unlock_page(page);
>                         return -EIO;
>                 }
> -               if (page->mapping != inode->i_mapping) {
> +
> +               /*
> +                * Since btrfs_readpage() will get the page unlocked, we
> have
> +                * a window where fadvice() can try to release the page.
> +                * Here we check both inode mapping and PagePrivate() to
> +                * make sure the page is not released.
> +                *
> +                * The priavte flag check is essential for subpage as we
> need
> +                * to store extra bitmap using page->private.
> +                */
> +               if (page->mapping != inode->i_mapping ||
> PagePrivate(page)) {
>                         unlock_page(page);
>                         return -EAGAIN;
>                 }
>

Ya, I was looking into the codepath to see if there is any chance where we may
release the pagelock and I think I may have seen this. but I was not sure on
whether this will hit for our case. But thanks for the explaination.

I would now like to review your patch series. Though I am not that familiar with
btrfs internals, but I would give my best to review and also ask if any queries
w.r.t patch series and/or related to bs < ps functionality in btrfs. :)

-ritesh

Qu Wenruo April 19, 2021, 7:19 a.m. UTC | #33

On 2021/4/19 下午2:16, Qu Wenruo wrote:
> 
> 
> On 2021/4/19 下午1:59, riteshh wrote:
>> On 21/04/16 10:22PM, riteshh wrote:
>>> On 21/04/16 02:14PM, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2021/4/16 下午1:50, riteshh wrote:
>>>>> On 21/04/16 09:34AM, Qu Wenruo wrote:
>>>>>>
>>>>>>
>>>>>> On 2021/4/16 上午7:34, Qu Wenruo wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2021/4/16 上午7:19, Qu Wenruo wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2021/4/15 下午10:52, riteshh wrote:
>>>>>>>>> On 21/04/15 09:14AM, riteshh wrote:
>>>>>>>>>> On 21/04/12 07:33PM, Qu Wenruo wrote:
>>>>>>>>>>> Good news, you can fetch the subpage branch for better test 
>>>>>>>>>>> results.
>>>>>>>>>>>
>>>>>>>>>>> Now the branch should pass all generic tests, except defrag 
>>>>>>>>>>> andknown
>>>>>>>>>>> failures.
>>>>>>>>>>> And no more random crash during the tests.
>>>>>>>>>>
>>>>>>>>>> Thanks, let me test it on PPC64 box.
>>>>>>>>>
>>>>>>>>> I do see some failures remaining with the patch series.
>>>>>>>>> However the one which is blocking my testing is the 
>>>>>>>>> tests/generic/095
>>>>>>>>> I see kernel BUG hitting with below signature.
>>>>>>>>
>>>>>>>> That's pretty different from my tests.
>>>>>>>>
>>>>>>>> As I haven't seen such BUG_ON() for a while.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please let me know if this a known failure?
>>>>>>>>>
>>>>>>>>> <xfstests config>
>>>>>>>>> #:~/work-tools/xfstests$ sudo ./check -g auto
>>>>>>>>> SECTION       -- btrfs_4k
>>>>>>>>> FSTYP         -- btrfs
>>>>>>>>> PLATFORM      -- Linux/ppc64le qemu 
>>>>>>>>> 5.12.0-rc7-02316-g3490dae50c0 #73
>>>>>>>>> SMP Thu Apr 15 07:29:23 CDT 2021
>>>>>>>>> MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
>>>>>>>>
>>>>>>>> I see you're using -n 4096, not the default -n 16K, let me see 
>>>>>>>> if I can
>>>>>>>> reproduce that.
>>>>>>>>
>>>>>>>> But from the backtrace, it doesn't look like the case,
>>>>>>>> as it happens for data path, which means it's only related to 
>>>>>>>> sectorsize.
>>>>>>>>
>>>>>>>>> MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> <kernel logs>
>>>>>>>>> [ 6057.560580] BTRFS warning (device loop3): read-write for sector
>>>>>>>>> size 4096 with page size 65536 is experimental
>>>>>>>>> [ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
>>>>>>>>> [ 6058.345127] BTRFS info (device loop2): disk space caching is 
>>>>>>>>> enabled
>>>>>>>>> [ 6058.348910] BTRFS info (device loop2): has skinny extents
>>>>>>>>> [ 6058.351930] BTRFS warning (device loop2): read-write for sector
>>>>>>>>> size 4096 with page size 65536 is experimental
>>>>>>>>> [ 6059.896382] BTRFS: device fsid 
>>>>>>>>> 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
>>>>>>>>> devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
>>>>>>>>> [ 6060.225107] BTRFS info (device loop3): disk space caching is 
>>>>>>>>> enabled
>>>>>>>>> [ 6060.226213] BTRFS info (device loop3): has skinny extents
>>>>>>>>> [ 6060.227084] BTRFS warning (device loop3): read-write for sector
>>>>>>>>> size 4096 with page size 65536 is experimental
>>>>>>>>> [ 6060.234537] BTRFS info (device loop3): checking UUID tree
>>>>>>>>> [ 6061.375902] assertion failed: PagePrivate(page) && 
>>>>>>>>> page->private,
>>>>>>>>> in fs/btrfs/subpage.c:171
>>>>>>>>> [ 6061.378296] ------------[ cut here ]------------
>>>>>>>>> [ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
>>>>>>>>> cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
>>>>>>>>>        pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
>>>>>>>>>        lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
>>>>>>>>>        sp: c0000000260d7730
>>>>>>>>>       msr: 800000000282b033
>>>>>>>>>      current = 0xc0000000260c0080
>>>>>>>>>      paca    = 0xc00000003fff8a00   irqmask: 0x03   
>>>>>>>>> irq_happened: 0x01
>>>>>>>>>        pid   = 739712, comm = fio
>>>>>>>>> kernel BUG at fs/btrfs/ctree.h:3403!
>>>>>>>>> Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc
>>>>>>>>> (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for 
>>>>>>>>> Ubuntu)
>>>>>>>>> 2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
>>>>>>>>> enter ? for help
>>>>>>>>> [c0000000260d7790] c000000000a90280
>>>>>>>>> btrfs_subpage_assert.isra.9+0x70/0x110
>>>>>>>>> [c0000000260d77b0] c000000000a91064
>>>>>>>>> btrfs_subpage_set_uptodate+0x54/0x110
>>>>>>>>> [c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
>>>>>>>>
>>>>>>>> This is very strange.
>>>>>>>> As in btrfs_dirty_pages(), the pages passed in are already 
>>>>>>>> prepared by
>>>>>>>> prepare_pages(), which means all of them should have Private set.
>>>>>>>>
>>>>>>>> Can you reproduce the bug reliable?
>>>>>
>>>>> Yes. almost reliably on my PPC box.
>>>>>
>>>>>>>
>>>>>>> OK, I got it reproduced.
>>>>>>>
>>>>>>> It's not a reliable BUG_ON(), but can be reproduced.
>>>>>>> The test get skipped for all my boards as it requires fio tool, 
>>>>>>> thus I
>>>>>>> didn't get it triggered for all previous runs.
>>>>>>>
>>>>>>> I'll take a look into the case.
>>>>>>
>>>>>> This exposed an interesting race window in btrfs_buffered_write():
>>>>>>           Writer                    |             fadvice
>>>>>> ----------------------------------+-------------------------------
>>>>>> btrfs_buffered_write()            |
>>>>>> |- prepare_pages()                |
>>>>>> |  |- Now all pages involved get  |
>>>>>> |     Private set                 |
>>>>>> |                                 | btrfs_release_page()
>>>>>> |                                 | |- Clear page Private
>>>>>> |- lock_extent()                  |
>>>>>> |  |- This would prevent          |
>>>>>> |     btrfs_release_page() to     |
>>>>>> |     clear the page Private      |
>>>>>> |
>>>>>> |- btrfs_dirty_page()
>>>>>>      |- Will trigger the BUG_ON()
>>>>>
>>>>>
>>>>> Sorry about the silly query. But help me understand how is above 
>>>>> racepossible?
>>>>> Won't prepare_pages() will lock all the pages first. The same 
>>>>> requirement
>>>>> of locked page should be with btrfs_releasepage() too no?
>>>>
>>>> releasepage() call can easily got a page locked and release it.
>>>>
>>>> For call sites like btrfs_invalidatepage(), the page is already locked.
>>>>
>>>> btrfs_releasepage() will not to try to release the page if the 
>>>> extent is
>>>> locked (any extent range inside the page has EXTENT_LOCK bit).
>>>>
>>>>>
>>>>> I see only two paths which could result into btrfs_releasepage()
>>>>> 1. one via try_to_release_pages -> releasepage()
>>>>
>>>> This is the race one, called from fadvice() to release pages.
>>>>
>>>>> 2. writeback path calling btrfs_writepage or btrfs_writepages
>>>>>     which may result into calling of btrfs_invalidatepage()
>>>>
>>>> Not this one.
>>>>
>>>>>
>>>>> Although I am not sure which one this is racing with.
>>>>>
>>>>>>
>>>>>> This only happens for subpage, because subpage introduces new 
>>>>>> ASSERT()
>>>>>> to do extra check.
>>>>>>
>>>>>> If we want to speak strictly, regular sector size should also report
>>>>>> this problem.
>>>>>> But regular sector size case doesn't really care about page 
>>>>>> Private,as
>>>>>> it just set page->private to a constant value, unlike subpage case 
>>>>>> which
>>>>>> stores important value.
>>>>>>
>>>>>> The fix will just re-set page Private and needed structures in
>>>>>> btrfs_dirty_page(), under extent locked so no btrfs_releasepage() is
>>>>>> able to release it anymore.
>>>>>
>>>>> With above fix I see a different issue with below signature.
>>>>>
>>>>> [  130.272410] BTRFS warning (device loop2): read-write for sector 
>>>>> size 4096 with page size 65536 is experimental
>>>>> [  130.387470] run fstests generic/095 at 2021-04-16 05:04:09
>>>>> [  132.042532] BTRFS: device fsid 
>>>>> 642daee0-165a-4271-b6f3-728f215c5348 devid 1 transid 5 /dev/loop3 
>>>>> scanned by mkfs.btrfs (5226)
>>>>> [  132.146892] BTRFS info (device loop3): disk space caching is 
>>>>> enabled
>>>>> [  132.147831] BTRFS info (device loop3): has skinny extents
>>>>> [  132.148491] BTRFS warning (device loop3): read-write for sector 
>>>>> size 4096 with page size 65536 is experimental
>>>>> [  132.158228] BTRFS info (device loop3): checking UUID tree
>>>>> [  133.931695] BUG: spinlock bad magic on CPU#4, swapper/4/0
>>>>> [  133.932874] BUG: Unable to handle kernel data access on write at 
>>>>> 0x6b6b6b6b6b6b725b
>>>>
>>>> That looks like some poisoned memory.
>>>>
>>>> I have run 128 runs of generic/095 locally on my Arm board during 
>>>> the fix,
>>>> unable to reproduce the crash anymore.
>>>>
>>>> And this call site is even harder to get race, as in endio context, 
>>>> the page
>>>> still has PageWriteback until the last bio finished in the page.
>>>>
>>>> This means btrfs_releasepage() will not even try to release the 
>>>> page, while
>>>> btrfs_invalidatepage() will wait the page to finish its writeback 
>>>> before
>>>> doing anything.
>>>>
>>>> So this is very strange to me.
>>>>
>>>> Any reproducibility on your side? Or something specific to Power is 
>>>> related
>>>> to this case? (IIRC some page flag operation is not atomic, maybe 
>>>> thatis
>>>> related?)
>>>
>>> I doubt if this is Power related. And yes, I can reproduce the issue 
>>> fairly
>>> easily. For now I will exclude the test from my run to get a overall 
>>> run with
>>
>> Here, are some other failures that I noticed during testing on Power.
>> Thanks for looking into this.
> 
> Thank you very much for the extra test!
> 
>>
>> 1. tests/btrfs/052
>> btrfs/052       [failed, exit status 1]- output mismatch (see 
>> /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad)
>>      --- tests/btrfs/052.out     2020-08-04 09:59:08.328299552 +0000
>>      +++ 
>> /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad      2021-04-16 
>> 17:18:17.762928432 +0000
>>      @@ -91,553 +91,5 @@
>>       23 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05
>>       *
>>       30
>>      -0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>      -*
>>      -2 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
>>      -*
>>      ...
>>      (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/052.out 
>> /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad'  
>> to see the entire diff)
>>
>> ^^^ this could also be due to below error found in 052.full
>>     ERROR: defrag range ioctl not supported in this kernel version, 
>> 2.6.33 and newer is required
>>     total 1 failures
>>     failed: '/usr/local/bin/btrfs filesystem defragment 
>> /mnt1/scratch/foo'
>>
>> 2. tests/btrfs/076 => looks a genuine failure.
>> btrfs/076       - output mismatch (see 
>> /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad)
>>      --- tests/btrfs/076.out     2020-08-04 09:59:08.338299786 +0000
>>      +++ 
>> /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad      2021-04-16 
>> 17:19:33.344981383 +0000
>>      @@ -1,3 +1,3 @@
>>       QA output created by 076
>>      -80
>>      -80
>>      +1
>>      +1
>>      ...
>>      (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/076.out 
>> /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad'  
>> to see the entire diff)
> 
> This is really a compression related one. Since I hardcoded to disable
> compression, the ratio is always be 1.
> 
>>
>> 3. tests/btrfs/106  => looks a genuine failure.
>> btrfs/106       - output mismatch (see 
>> /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad)
>>      --- tests/btrfs/106.out     2020-08-04 09:59:08.348300020 +0000
>>      +++ 
>> /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad      2021-04-16 
>> 17:49:27.296128823 +0000
>>      @@ -5,19 +5,19 @@
>>       File contents before unmount:
>>       0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>       *
>>      -40
>>      +1000
>>       File contents after remount:
>>       0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>      ...
>>      (Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/106.out 
>> /home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad'  
>> to see the entire diff)
> 
> That's a similar problem, compression needed
> while compression is hard coded to be disable, thus clone reports
> different value.
> 
>>
>>> these patches. Later will try and debug what is going on.
>>>
>>> But if you need any debug logs - do let me know, as it is fairly easily
>>> reproducible.
>>
>> For tests/generic/095 can you pls retry reproducing the issue (with 
>> yourlatest
>> patch) on your setup with below configs enabled?
>> 1. CONFIG_PAGE_OWNER, CONFIG_PAGE_POISONING, CONFIG_SLUB_DEBUG_ON,
>>     CONFIG_SCHED_STACK_END_CHECK, CONFIG_DEBUG_VM, 
>> CONFIG_DEBUG_STACKOVERFLOW,
>>     CONFIG_DEBUG_VM_PGFLAGS, CONFIG_DEBUG_SPINLOCK, CONFIG_PROVE_LOCKING
> 
> Thanks, I'll retry using the extra debugging options.
> 
> But I have a more solid explanation on why the bug happens now.
> 
> You're right, prepare_pages() should have the page locked by calling
> find_or_create_page(), so btrfs_releasepage() shouldn't sneak in and
> just release the page.
> 
> But there is a small window in prepare_uptodate_page(), where we may
> call btrfs_readpage(), which will unlock the page.
> 
> So there is a window where we have page unlocked, before we re-lock it
> in prepare_uptodate_page().
> 
> By that, we got a page with its Private bit cleared.
> 
> I'm trying a better fix like the following diff.
> But I'm not yet 100% confident if the PagePrivate() check is enough,
> thus I'll do more test before sending the proper fix.
> 
> Thanks,
> Qu
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 45ec3f5ef839..49f78d643392 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode 
> *inode,
>                         unlock_page(page);
>                         return -EIO;
>                 }
> -               if (page->mapping != inode->i_mapping) {
> +
> +               /*
> +                * Since btrfs_readpage() will get the page unlocked, we
> have
> +                * a window where fadvice() can try to release the page.
> +                * Here we check both inode mapping and PagePrivate() to
> +                * make sure the page is not released.
> +                *
> +                * The priavte flag check is essential for subpage as we
> need
> +                * to store extra bitmap using page->private.
> +                */
> +               if (page->mapping != inode->i_mapping ||
> PagePrivate(page)) {
   ^ Obviously it should be !PagePrivate(page).

Thanks,
Qu

>                         unlock_page(page);
>                         return -EAGAIN;
>                 }
> 
> 
>>
>>
>> -ritesh
>>

Qu Wenruo April 19, 2021, 1:24 p.m. UTC | #34

[...]
>>
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index 45ec3f5ef839..49f78d643392 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode 
>> *inode,
>>                         unlock_page(page);
>>                         return -EIO;
>>                 }
>> -               if (page->mapping != inode->i_mapping) {
>> +
>> +               /*
>> +                * Since btrfs_readpage() will get the page unlocked, we
>> have
>> +                * a window where fadvice() can try to release the page.
>> +                * Here we check both inode mapping and PagePrivate() to
>> +                * make sure the page is not released.
>> +                *
>> +                * The priavte flag check is essential for subpage as we
>> need
>> +                * to store extra bitmap using page->private.
>> +                */
>> +               if (page->mapping != inode->i_mapping ||
>> PagePrivate(page)) {
>   ^ Obviously it should be !PagePrivate(page).

Hi Ritesh,

Mind to have another try on generic/095?

This time the branch is updated with the following commit at top:

commit d700b16dced6f2e2b47e1ca5588a92216ce84dfb (HEAD -> subpage, 
github/subpage)
Author: Qu Wenruo <wqu@suse.com>
Date:   Mon Apr 19 13:41:31 2021 +0800

     btrfs: fix a crash caused by race between prepare_pages() and
     btrfs_releasepage()

The fix uses the PagePrivate() check to avoid the problem, and passes 
several generic/auto loops without any sign of crash.

But considering I always have difficult in reproducing the bug with 
previous improper fix, your verification would be very helpful.

Thanks,
Qu
> 
> Thanks,
> Qu
> 
>>                         unlock_page(page);
>>                         return -EAGAIN;
>>                 }
>>
>>
>>>
>>>
>>> -ritesh
>>>
>

Ritesh Harjani April 21, 2021, 7:03 a.m. UTC | #35

On 21/04/19 09:24PM, Qu Wenruo wrote:
> [...]
> > >
> > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > index 45ec3f5ef839..49f78d643392 100644
> > > --- a/fs/btrfs/file.c
> > > +++ b/fs/btrfs/file.c
> > > @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode
> > > *inode,
> > >                         unlock_page(page);
> > >                         return -EIO;
> > >                 }
> > > -               if (page->mapping != inode->i_mapping) {
> > > +
> > > +               /*
> > > +                * Since btrfs_readpage() will get the page unlocked, we
> > > have
> > > +                * a window where fadvice() can try to release the page.
> > > +                * Here we check both inode mapping and PagePrivate() to
> > > +                * make sure the page is not released.
> > > +                *
> > > +                * The priavte flag check is essential for subpage as we
> > > need
> > > +                * to store extra bitmap using page->private.
> > > +                */
> > > +               if (page->mapping != inode->i_mapping ||
> > > PagePrivate(page)) {
> >   ^ Obviously it should be !PagePrivate(page).
>
> Hi Ritesh,
>
> Mind to have another try on generic/095?
>
> This time the branch is updated with the following commit at top:
>
> commit d700b16dced6f2e2b47e1ca5588a92216ce84dfb (HEAD -> subpage,
> github/subpage)
> Author: Qu Wenruo <wqu@suse.com>
> Date:   Mon Apr 19 13:41:31 2021 +0800
>
>     btrfs: fix a crash caused by race between prepare_pages() and
>     btrfs_releasepage()
>
> The fix uses the PagePrivate() check to avoid the problem, and passes
> several generic/auto loops without any sign of crash.
>
> But considering I always have difficult in reproducing the bug with previous
> improper fix, your verification would be very helpful.
>

Hi Qu,

Thanks for the patch. I did try above patch but even with this I could still
reproduce the issue.

1. I think the original problem could be due to below logs.
	[   79.079641] run fstests generic/095 at 2021-04-21 06:46:23
	<...>
	[   83.634710] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!

Meaning, there might be a race here between DIO and buffered IO.
So from DIO path we call invalidate_inode_pages2_range(). Somehow this maybe
causing call of btrfs_releasepage().

Now from code, invalidate_inode_pages2_range() can be called from both
__iomap_dio_rw() and from iomap_dio_complete(). So it is not clear as to from
where this might be triggering this bug.

I will try and debug more. But I thought I will update you with above findings.

-ritesh

Qu Wenruo April 21, 2021, 7:15 a.m. UTC | #36

On 2021/4/21 下午3:03, riteshh wrote:
> On 21/04/19 09:24PM, Qu Wenruo wrote:
>> [...]
>>>>
>>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>>> index 45ec3f5ef839..49f78d643392 100644
>>>> --- a/fs/btrfs/file.c
>>>> +++ b/fs/btrfs/file.c
>>>> @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode
>>>> *inode,
>>>>                          unlock_page(page);
>>>>                          return -EIO;
>>>>                  }
>>>> -               if (page->mapping != inode->i_mapping) {
>>>> +
>>>> +               /*
>>>> +                * Since btrfs_readpage() will get the page unlocked, we
>>>> have
>>>> +                * a window where fadvice() can try to release the page.
>>>> +                * Here we check both inode mapping and PagePrivate() to
>>>> +                * make sure the page is not released.
>>>> +                *
>>>> +                * The priavte flag check is essential for subpage as we
>>>> need
>>>> +                * to store extra bitmap using page->private.
>>>> +                */
>>>> +               if (page->mapping != inode->i_mapping ||
>>>> PagePrivate(page)) {
>>>    ^ Obviously it should be !PagePrivate(page).
>>
>> Hi Ritesh,
>>
>> Mind to have another try on generic/095?
>>
>> This time the branch is updated with the following commit at top:
>>
>> commit d700b16dced6f2e2b47e1ca5588a92216ce84dfb (HEAD -> subpage,
>> github/subpage)
>> Author: Qu Wenruo <wqu@suse.com>
>> Date:   Mon Apr 19 13:41:31 2021 +0800
>>
>>      btrfs: fix a crash caused by race between prepare_pages() and
>>      btrfs_releasepage()
>>
>> The fix uses the PagePrivate() check to avoid the problem, and passes
>> several generic/auto loops without any sign of crash.
>>
>> But considering I always have difficult in reproducing the bug with previous
>> improper fix, your verification would be very helpful.
>>
>
> Hi Qu,
>
> Thanks for the patch. I did try above patch but even with this I could still
> reproduce the issue.
>
> 1. I think the original problem could be due to below logs.
> 	[   79.079641] run fstests generic/095 at 2021-04-21 06:46:23
> 	<...>
> 	[   83.634710] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
>
> Meaning, there might be a race here between DIO and buffered IO.
> So from DIO path we call invalidate_inode_pages2_range(). Somehow this maybe
> causing call of btrfs_releasepage().
>
> Now from code, invalidate_inode_pages2_range() can be called from both
> __iomap_dio_rw() and from iomap_dio_complete(). So it is not clear as to from
> where this might be triggering this bug. >
> I will try and debug more. But I thought I will update you with above findings.

Your finding and testing are really helpful.

BTW, Goldwyn helped me to test the same patchset on power too, but
unfortunately he didn't reproduce the bug either on generic/095.

So I'm afraid the bug is way more complex than I thought.

BTW, have you tried to enable KASAN and to see if KASAN can find the
problem?

Thanks,
Qu
>
> -ritesh
>

Ritesh Harjani April 21, 2021, 7:30 a.m. UTC | #37

On 21/04/21 12:33PM, riteshh wrote:
> On 21/04/19 09:24PM, Qu Wenruo wrote:
> > [...]
> > > >
> > > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > > index 45ec3f5ef839..49f78d643392 100644
> > > > --- a/fs/btrfs/file.c
> > > > +++ b/fs/btrfs/file.c
> > > > @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode
> > > > *inode,
> > > >                         unlock_page(page);
> > > >                         return -EIO;
> > > >                 }
> > > > -               if (page->mapping != inode->i_mapping) {
> > > > +
> > > > +               /*
> > > > +                * Since btrfs_readpage() will get the page unlocked, we
> > > > have
> > > > +                * a window where fadvice() can try to release the page.
> > > > +                * Here we check both inode mapping and PagePrivate() to
> > > > +                * make sure the page is not released.
> > > > +                *
> > > > +                * The priavte flag check is essential for subpage as we
> > > > need
> > > > +                * to store extra bitmap using page->private.
> > > > +                */
> > > > +               if (page->mapping != inode->i_mapping ||
> > > > PagePrivate(page)) {
> > >   ^ Obviously it should be !PagePrivate(page).
> >
> > Hi Ritesh,
> >
> > Mind to have another try on generic/095?
> >
> > This time the branch is updated with the following commit at top:
> >
> > commit d700b16dced6f2e2b47e1ca5588a92216ce84dfb (HEAD -> subpage,
> > github/subpage)
> > Author: Qu Wenruo <wqu@suse.com>
> > Date:   Mon Apr 19 13:41:31 2021 +0800
> >
> >     btrfs: fix a crash caused by race between prepare_pages() and
> >     btrfs_releasepage()
> >
> > The fix uses the PagePrivate() check to avoid the problem, and passes
> > several generic/auto loops without any sign of crash.
> >
> > But considering I always have difficult in reproducing the bug with previous
> > improper fix, your verification would be very helpful.
> >
>
> Hi Qu,
>
> Thanks for the patch. I did try above patch but even with this I could still
> reproduce the issue.
>
> 1. I think the original problem could be due to below logs.
> 	[   79.079641] run fstests generic/095 at 2021-04-21 06:46:23
> 	<...>
> 	[   83.634710] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
>
> Meaning, there might be a race here between DIO and buffered IO.
> So from DIO path we call invalidate_inode_pages2_range(). Somehow this maybe
> causing call of btrfs_releasepage().
>
> Now from code, invalidate_inode_pages2_range() can be called from both
> __iomap_dio_rw() and from iomap_dio_complete(). So it is not clear as to from
> where this might be triggering this bug.

I think I got one of the problem.
1. we use page->private pointer as btrfs_subpage struct which also happens to
   hold spinlock within it.

   Now in btrfs_subpage_clear_writeback()
   -> we take this spinlock  spin_lock_irqsave(&subpage->lock, flags);
   -> we call end_page_writeback(page);
   		  -> this may end up waking up invalidate_inode_pages2_range()
		  which is waiting for writeback to complete.
			  -> this then may also call btrfs_releasepage() on the
			  same page and also free the subpage structure.

   -> then we call spin_unlock => here the btrfs_subpage structure got freed
   but we still accessed and hence causing spinlock bug corruption

<below call traces were observed without any fixes>
<i.e. tree contained patches till "btrfs: reject raid5/6 fs for subpage">
[   79.079641] run fstests generic/095 at 2021-04-21 06:46:23
[   81.118576] BTRFS: device fsid 0450e360-e0ea-4cff-9f84-3c6064437ef6 devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (4669)
[   81.208410] BTRFS info (device loop3): disk space caching is enabled
[   81.209219] BTRFS info (device loop3): has skinny extents
[   81.209849] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
[   81.219579] BTRFS info (device loop3): checking UUID tree
[   83.634710] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   83.639921] File: /mnt1/scratch/file1 PID: 221 Comm: kworker/30:1
[   85.130349] fio (4720) used greatest stack depth: 7808 bytes left
[   87.022500] BUG: spinlock bad magic on CPU#26, swapper/26/0
[   87.023457] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
[   87.024776] Faulting instruction address: 0xc000000000283654
cpu 0x1a: Vector: 380 (Data SLB Access) at [c000000007af7160]
    pc: c000000000283654: spin_dump+0x70/0xbc
    lr: c000000000283638: spin_dump+0x54/0xbc
    sp: c000000007af7400
   msr: 8000000000009033
   dar: 6b6b6b6b6b6b725b
  current = 0xc000000007ab9800
  paca    = 0xc00000003ffc9a00   irqmask: 0x03   irq_happened: 0x01
    pid   = 0, comm = swapper/26
Linux version 5.12.0-rc7-02317-gee3f9a64895 (riteshh@ltctulc6a-p1) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #78 SMP Wed Apr 21 01:10:41 CDT 2021
enter ? for help
[c000000007af7470] c000000000283078 do_raw_spin_unlock+0x88/0x230
[c000000007af74a0] c0000000012b1e34 _raw_spin_unlock_irqrestore+0x44/0x90
[c000000007af74d0] c000000000a918fc btrfs_subpage_clear_writeback+0xac/0xe0
[c000000007af7530] c0000000009e0478 end_bio_extent_writepage+0x158/0x270
[c000000007af75f0] c000000000b6fd34 bio_endio+0x254/0x270
[c000000007af7630] c0000000009fc110 btrfs_end_bio+0x1a0/0x200
[c000000007af7670] c000000000b6fd34 bio_endio+0x254/0x270
[c000000007af76b0] c000000000b7821c blk_update_request+0x46c/0x670
[c000000007af7760] c000000000b8b3b4 blk_mq_end_request+0x34/0x1d0
[c000000007af77a0] c000000000d82d3c lo_complete_rq+0x11c/0x140
[c000000007af77d0] c000000000b880c4 blk_complete_reqs+0x84/0xb0
[c000000007af7800] c0000000012b2cc4 __do_softirq+0x334/0x680
[c000000007af7910] c0000000001dd878 irq_exit+0x148/0x1d0
[c000000007af7940] c000000000016f4c do_IRQ+0x20c/0x240
[c000000007af79d0] c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0

-ritesh

Qu Wenruo April 21, 2021, 8:26 a.m. UTC | #38

On 2021/4/21 下午3:30, riteshh wrote:
> On 21/04/21 12:33PM, riteshh wrote:
>> On 21/04/19 09:24PM, Qu Wenruo wrote:
>>> [...]
>>>>>
>>>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>>>> index 45ec3f5ef839..49f78d643392 100644
>>>>> --- a/fs/btrfs/file.c
>>>>> +++ b/fs/btrfs/file.c
>>>>> @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode
>>>>> *inode,
>>>>>                          unlock_page(page);
>>>>>                          return -EIO;
>>>>>                  }
>>>>> -               if (page->mapping != inode->i_mapping) {
>>>>> +
>>>>> +               /*
>>>>> +                * Since btrfs_readpage() will get the page unlocked, we
>>>>> have
>>>>> +                * a window where fadvice() can try to release the page.
>>>>> +                * Here we check both inode mapping and PagePrivate() to
>>>>> +                * make sure the page is not released.
>>>>> +                *
>>>>> +                * The priavte flag check is essential for subpage as we
>>>>> need
>>>>> +                * to store extra bitmap using page->private.
>>>>> +                */
>>>>> +               if (page->mapping != inode->i_mapping ||
>>>>> PagePrivate(page)) {
>>>>    ^ Obviously it should be !PagePrivate(page).
>>>
>>> Hi Ritesh,
>>>
>>> Mind to have another try on generic/095?
>>>
>>> This time the branch is updated with the following commit at top:
>>>
>>> commit d700b16dced6f2e2b47e1ca5588a92216ce84dfb (HEAD -> subpage,
>>> github/subpage)
>>> Author: Qu Wenruo <wqu@suse.com>
>>> Date:   Mon Apr 19 13:41:31 2021 +0800
>>>
>>>      btrfs: fix a crash caused by race between prepare_pages() and
>>>      btrfs_releasepage()
>>>
>>> The fix uses the PagePrivate() check to avoid the problem, and passes
>>> several generic/auto loops without any sign of crash.
>>>
>>> But considering I always have difficult in reproducing the bug with previous
>>> improper fix, your verification would be very helpful.
>>>
>>
>> Hi Qu,
>>
>> Thanks for the patch. I did try above patch but even with this I could still
>> reproduce the issue.
>>
>> 1. I think the original problem could be due to below logs.
>> 	[   79.079641] run fstests generic/095 at 2021-04-21 06:46:23
>> 	<...>
>> 	[   83.634710] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
>>
>> Meaning, there might be a race here between DIO and buffered IO.
>> So from DIO path we call invalidate_inode_pages2_range(). Somehow this maybe
>> causing call of btrfs_releasepage().
>>
>> Now from code, invalidate_inode_pages2_range() can be called from both
>> __iomap_dio_rw() and from iomap_dio_complete(). So it is not clear as to from
>> where this might be triggering this bug.
> 
> I think I got one of the problem.
> 1. we use page->private pointer as btrfs_subpage struct which also happens to
>     hold spinlock within it.
> 
>     Now in btrfs_subpage_clear_writeback()
>     -> we take this spinlock  spin_lock_irqsave(&subpage->lock, flags);
>     -> we call end_page_writeback(page);
>     		  -> this may end up waking up invalidate_inode_pages2_range()
> 		  which is waiting for writeback to complete.
> 			  -> this then may also call btrfs_releasepage() on the
> 			  same page and also free the subpage structure.

This indeeds looks like a problem.

This really means we need to have such a small race window below:
(btrfs_invalidatepage() doesn't seem to be possible to race considering
  how much work needed to be done in that function)

	Thread 1		|		Thread 2
--------------------------------+------------------------------------
  end_bio_extent_writepage()	| btrfs_releasepage()
  |- spin_lock_irqsave()		| |
  |- end_page_writeback()	| |
  |				| |- if (PageWriteback() ||...)
  |				| |- clear_page_extent_mapped()
  |- spin_unlock_irqrestore().

It looks like my arm boards are not fast enough to trigger the race.

Although it can be fixed by doing the same thing as dirty bit, by 
checking the bitmap first and then call end_page_writeback() with 
spinlock unlocked.

Would you please try the following fix? (based on the latest branch, 
which already has the previous fixes included).

I'm also running the tests on all my arm boards to make sure it doesn't 
cause extra problem, so far so good, but my board is far from fast, thus 
not yet 100% tested.

Thanks,
Qu

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 696485ab68a2..c5abf9745c10 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -420,13 +420,16 @@ void btrfs_subpage_clear_writeback(const struct 
btrfs_fs_info *fs_info,
  {
         struct btrfs_subpage *subpage = (struct btrfs_subpage 
*)page->private;
         u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+       bool finished = false;
         unsigned long flags;

         spin_lock_irqsave(&subpage->lock, flags);
         subpage->writeback_bitmap &= ~tmp;
         if (subpage->writeback_bitmap == 0)
-               end_page_writeback(page);
+               finished = true;
         spin_unlock_irqrestore(&subpage->lock, flags);
+       if (finished)
+               end_page_writeback(page);
  }

  void btrfs_subpage_set_ordered(const struct btrfs_fs_info *fs_info,

> 
>     -> then we call spin_unlock => here the btrfs_subpage structure got freed
>     but we still accessed and hence causing spinlock bug corruption
> 
> <below call traces were observed without any fixes>
> <i.e. tree contained patches till "btrfs: reject raid5/6 fs for subpage">
> [   79.079641] run fstests generic/095 at 2021-04-21 06:46:23
> [   81.118576] BTRFS: device fsid 0450e360-e0ea-4cff-9f84-3c6064437ef6 devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (4669)
> [   81.208410] BTRFS info (device loop3): disk space caching is enabled
> [   81.209219] BTRFS info (device loop3): has skinny extents
> [   81.209849] BTRFS warning (device loop3): read-write for sector size 4096 with page size 65536 is experimental
> [   81.219579] BTRFS info (device loop3): checking UUID tree
> [   83.634710] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> [   83.639921] File: /mnt1/scratch/file1 PID: 221 Comm: kworker/30:1
> [   85.130349] fio (4720) used greatest stack depth: 7808 bytes left
> [   87.022500] BUG: spinlock bad magic on CPU#26, swapper/26/0
> [   87.023457] BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
> [   87.024776] Faulting instruction address: 0xc000000000283654
> cpu 0x1a: Vector: 380 (Data SLB Access) at [c000000007af7160]
>      pc: c000000000283654: spin_dump+0x70/0xbc
>      lr: c000000000283638: spin_dump+0x54/0xbc
>      sp: c000000007af7400
>     msr: 8000000000009033
>     dar: 6b6b6b6b6b6b725b
>    current = 0xc000000007ab9800
>    paca    = 0xc00000003ffc9a00   irqmask: 0x03   irq_happened: 0x01
>      pid   = 0, comm = swapper/26
> Linux version 5.12.0-rc7-02317-gee3f9a64895 (riteshh@ltctulc6a-p1) (gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #78 SMP Wed Apr 21 01:10:41 CDT 2021
> enter ? for help
> [c000000007af7470] c000000000283078 do_raw_spin_unlock+0x88/0x230
> [c000000007af74a0] c0000000012b1e34 _raw_spin_unlock_irqrestore+0x44/0x90
> [c000000007af74d0] c000000000a918fc btrfs_subpage_clear_writeback+0xac/0xe0
> [c000000007af7530] c0000000009e0478 end_bio_extent_writepage+0x158/0x270
> [c000000007af75f0] c000000000b6fd34 bio_endio+0x254/0x270
> [c000000007af7630] c0000000009fc110 btrfs_end_bio+0x1a0/0x200
> [c000000007af7670] c000000000b6fd34 bio_endio+0x254/0x270
> [c000000007af76b0] c000000000b7821c blk_update_request+0x46c/0x670
> [c000000007af7760] c000000000b8b3b4 blk_mq_end_request+0x34/0x1d0
> [c000000007af77a0] c000000000d82d3c lo_complete_rq+0x11c/0x140
> [c000000007af77d0] c000000000b880c4 blk_complete_reqs+0x84/0xb0
> [c000000007af7800] c0000000012b2cc4 __do_softirq+0x334/0x680
> [c000000007af7910] c0000000001dd878 irq_exit+0x148/0x1d0
> [c000000007af7940] c000000000016f4c do_IRQ+0x20c/0x240
> [c000000007af79d0] c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0
> 
> -ritesh
>

Ritesh Harjani April 21, 2021, 11:13 a.m. UTC | #39

On 21/04/21 04:26PM, Qu Wenruo wrote:
>
>
> On 2021/4/21 下午3:30, riteshh wrote:
> > On 21/04/21 12:33PM, riteshh wrote:
> > > On 21/04/19 09:24PM, Qu Wenruo wrote:
> > > > [...]
> > > > > >
> > > > > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > > > > index 45ec3f5ef839..49f78d643392 100644
> > > > > > --- a/fs/btrfs/file.c
> > > > > > +++ b/fs/btrfs/file.c
> > > > > > @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode
> > > > > > *inode,
> > > > > >                          unlock_page(page);
> > > > > >                          return -EIO;
> > > > > >                  }
> > > > > > -               if (page->mapping != inode->i_mapping) {
> > > > > > +
> > > > > > +               /*
> > > > > > +                * Since btrfs_readpage() will get the page unlocked, we
> > > > > > have
> > > > > > +                * a window where fadvice() can try to release the page.
> > > > > > +                * Here we check both inode mapping and PagePrivate() to
> > > > > > +                * make sure the page is not released.
> > > > > > +                *
> > > > > > +                * The priavte flag check is essential for subpage as we
> > > > > > need
> > > > > > +                * to store extra bitmap using page->private.
> > > > > > +                */
> > > > > > +               if (page->mapping != inode->i_mapping ||
> > > > > > PagePrivate(page)) {
> > > > >    ^ Obviously it should be !PagePrivate(page).
> > > >
> > > > Hi Ritesh,
> > > >
> > > > Mind to have another try on generic/095?
> > > >
> > > > This time the branch is updated with the following commit at top:
> > > >
> > > > commit d700b16dced6f2e2b47e1ca5588a92216ce84dfb (HEAD -> subpage,
> > > > github/subpage)
> > > > Author: Qu Wenruo <wqu@suse.com>
> > > > Date:   Mon Apr 19 13:41:31 2021 +0800
> > > >
> > > >      btrfs: fix a crash caused by race between prepare_pages() and
> > > >      btrfs_releasepage()
> > > >
> > > > The fix uses the PagePrivate() check to avoid the problem, and passes
> > > > several generic/auto loops without any sign of crash.
> > > >
> > > > But considering I always have difficult in reproducing the bug with previous
> > > > improper fix, your verification would be very helpful.
> > > >
> > >
> > > Hi Qu,
> > >
> > > Thanks for the patch. I did try above patch but even with this I could still
> > > reproduce the issue.
> > >
> > > 1. I think the original problem could be due to below logs.
> > > 	[   79.079641] run fstests generic/095 at 2021-04-21 06:46:23
> > > 	<...>
> > > 	[   83.634710] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> > >
> > > Meaning, there might be a race here between DIO and buffered IO.
> > > So from DIO path we call invalidate_inode_pages2_range(). Somehow this maybe
> > > causing call of btrfs_releasepage().
> > >
> > > Now from code, invalidate_inode_pages2_range() can be called from both
> > > __iomap_dio_rw() and from iomap_dio_complete(). So it is not clear as to from
> > > where this might be triggering this bug.
> >
> > I think I got one of the problem.
> > 1. we use page->private pointer as btrfs_subpage struct which also happens to
> >     hold spinlock within it.
> >
> >     Now in btrfs_subpage_clear_writeback()
> >     -> we take this spinlock  spin_lock_irqsave(&subpage->lock, flags);
> >     -> we call end_page_writeback(page);
> >     		  -> this may end up waking up invalidate_inode_pages2_range()
> > 		  which is waiting for writeback to complete.
> > 			  -> this then may also call btrfs_releasepage() on the
> > 			  same page and also free the subpage structure.
>
> This indeeds looks like a problem.
>
> This really means we need to have such a small race window below:
> (btrfs_invalidatepage() doesn't seem to be possible to race considering
>  how much work needed to be done in that function)
>
> 	Thread 1		|		Thread 2
> --------------------------------+------------------------------------
>  end_bio_extent_writepage()	| btrfs_releasepage()
>  |- spin_lock_irqsave()		| |
>  |- end_page_writeback()	| |
>  |				| |- if (PageWriteback() ||...)
>  |				| |- clear_page_extent_mapped()
>  |- spin_unlock_irqrestore().
>
> It looks like my arm boards are not fast enough to trigger the race.
>
> Although it can be fixed by doing the same thing as dirty bit, by checking
> the bitmap first and then call end_page_writeback() with spinlock unlocked.
>
> Would you please try the following fix? (based on the latest branch, which
> already has the previous fixes included).
>
> I'm also running the tests on all my arm boards to make sure it doesn't
> cause extra problem, so far so good, but my board is far from fast, thus not
> yet 100% tested.
>
> Thanks,
> Qu
>
> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> index 696485ab68a2..c5abf9745c10 100644
> --- a/fs/btrfs/subpage.c
> +++ b/fs/btrfs/subpage.c
> @@ -420,13 +420,16 @@ void btrfs_subpage_clear_writeback(const struct
> btrfs_fs_info *fs_info,
>  {
>         struct btrfs_subpage *subpage = (struct btrfs_subpage
> *)page->private;
>         u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
> +       bool finished = false;
>         unsigned long flags;
>
>         spin_lock_irqsave(&subpage->lock, flags);
>         subpage->writeback_bitmap &= ~tmp;
>         if (subpage->writeback_bitmap == 0)
> -               end_page_writeback(page);
> +               finished = true;
>         spin_unlock_irqrestore(&subpage->lock, flags);
> +       if (finished)
> +               end_page_writeback(page);
>  }
>
>  void btrfs_subpage_set_ordered(const struct btrfs_fs_info *fs_info,

Thanks for this patch. I have re-tested generic/095 with 100 iterations and -g
quick (with both of your patches). I don't see this issue anymore.
So with the two patches (including above one) the race with
btrfs_releasepage() is now fixed.


For both of these patches, please feel free to add:

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>

-ritesh

Qu Wenruo April 21, 2021, 11:42 a.m. UTC | #40

On 2021/4/21 下午7:13, riteshh wrote:
> On 21/04/21 04:26PM, Qu Wenruo wrote:
>>
>>
>> On 2021/4/21 下午3:30, riteshh wrote:
>>> On 21/04/21 12:33PM, riteshh wrote:
>>>> On 21/04/19 09:24PM, Qu Wenruo wrote:
>>>>> [...]
>>>>>>>
>>>>>>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>>>>>>> index 45ec3f5ef839..49f78d643392 100644
>>>>>>> --- a/fs/btrfs/file.c
>>>>>>> +++ b/fs/btrfs/file.c
>>>>>>> @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode
>>>>>>> *inode,
>>>>>>>                           unlock_page(page);
>>>>>>>                           return -EIO;
>>>>>>>                   }
>>>>>>> -               if (page->mapping != inode->i_mapping) {
>>>>>>> +
>>>>>>> +               /*
>>>>>>> +                * Since btrfs_readpage() will get the page unlocked, we
>>>>>>> have
>>>>>>> +                * a window where fadvice() can try to release the page.
>>>>>>> +                * Here we check both inode mapping and PagePrivate() to
>>>>>>> +                * make sure the page is not released.
>>>>>>> +                *
>>>>>>> +                * The priavte flag check is essential for subpage as we
>>>>>>> need
>>>>>>> +                * to store extra bitmap using page->private.
>>>>>>> +                */
>>>>>>> +               if (page->mapping != inode->i_mapping ||
>>>>>>> PagePrivate(page)) {
>>>>>>     ^ Obviously it should be !PagePrivate(page).
>>>>>
>>>>> Hi Ritesh,
>>>>>
>>>>> Mind to have another try on generic/095?
>>>>>
>>>>> This time the branch is updated with the following commit at top:
>>>>>
>>>>> commit d700b16dced6f2e2b47e1ca5588a92216ce84dfb (HEAD -> subpage,
>>>>> github/subpage)
>>>>> Author: Qu Wenruo <wqu@suse.com>
>>>>> Date:   Mon Apr 19 13:41:31 2021 +0800
>>>>>
>>>>>       btrfs: fix a crash caused by race between prepare_pages() and
>>>>>       btrfs_releasepage()
>>>>>
>>>>> The fix uses the PagePrivate() check to avoid the problem, and passes
>>>>> several generic/auto loops without any sign of crash.
>>>>>
>>>>> But considering I always have difficult in reproducing the bug with previous
>>>>> improper fix, your verification would be very helpful.
>>>>>
>>>>
>>>> Hi Qu,
>>>>
>>>> Thanks for the patch. I did try above patch but even with this I could still
>>>> reproduce the issue.
>>>>
>>>> 1. I think the original problem could be due to below logs.
>>>> 	[   79.079641] run fstests generic/095 at 2021-04-21 06:46:23
>>>> 	<...>
>>>> 	[   83.634710] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
>>>>
>>>> Meaning, there might be a race here between DIO and buffered IO.
>>>> So from DIO path we call invalidate_inode_pages2_range(). Somehow this maybe
>>>> causing call of btrfs_releasepage().
>>>>
>>>> Now from code, invalidate_inode_pages2_range() can be called from both
>>>> __iomap_dio_rw() and from iomap_dio_complete(). So it is not clear as to from
>>>> where this might be triggering this bug.
>>>
>>> I think I got one of the problem.
>>> 1. we use page->private pointer as btrfs_subpage struct which also happens to
>>>      hold spinlock within it.
>>>
>>>      Now in btrfs_subpage_clear_writeback()
>>>      -> we take this spinlock  spin_lock_irqsave(&subpage->lock, flags);
>>>      -> we call end_page_writeback(page);
>>>      		  -> this may end up waking up invalidate_inode_pages2_range()
>>> 		  which is waiting for writeback to complete.
>>> 			  -> this then may also call btrfs_releasepage() on the
>>> 			  same page and also free the subpage structure.
>>
>> This indeeds looks like a problem.
>>
>> This really means we need to have such a small race window below:
>> (btrfs_invalidatepage() doesn't seem to be possible to race considering
>>   how much work needed to be done in that function)
>>
>> 	Thread 1		|		Thread 2
>> --------------------------------+------------------------------------
>>   end_bio_extent_writepage()	| btrfs_releasepage()
>>   |- spin_lock_irqsave()		| |
>>   |- end_page_writeback()	| |
>>   |				| |- if (PageWriteback() ||...)
>>   |				| |- clear_page_extent_mapped()
>>   |- spin_unlock_irqrestore().
>>
>> It looks like my arm boards are not fast enough to trigger the race.
>>
>> Although it can be fixed by doing the same thing as dirty bit, by checking
>> the bitmap first and then call end_page_writeback() with spinlock unlocked.
>>
>> Would you please try the following fix? (based on the latest branch, which
>> already has the previous fixes included).
>>
>> I'm also running the tests on all my arm boards to make sure it doesn't
>> cause extra problem, so far so good, but my board is far from fast, thus not
>> yet 100% tested.
>>
>> Thanks,
>> Qu
>>
>> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
>> index 696485ab68a2..c5abf9745c10 100644
>> --- a/fs/btrfs/subpage.c
>> +++ b/fs/btrfs/subpage.c
>> @@ -420,13 +420,16 @@ void btrfs_subpage_clear_writeback(const struct
>> btrfs_fs_info *fs_info,
>>   {
>>          struct btrfs_subpage *subpage = (struct btrfs_subpage
>> *)page->private;
>>          u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
>> +       bool finished = false;
>>          unsigned long flags;
>>
>>          spin_lock_irqsave(&subpage->lock, flags);
>>          subpage->writeback_bitmap &= ~tmp;
>>          if (subpage->writeback_bitmap == 0)
>> -               end_page_writeback(page);
>> +               finished = true;
>>          spin_unlock_irqrestore(&subpage->lock, flags);
>> +       if (finished)
>> +               end_page_writeback(page);
>>   }
>>
>>   void btrfs_subpage_set_ordered(const struct btrfs_fs_info *fs_info,
>
> Thanks for this patch. I have re-tested generic/095 with 100 iterations and -g
> quick (with both of your patches). I don't see this issue anymore.
> So with the two patches (including above one) the race with
> btrfs_releasepage() is now fixed.
>
>
> For both of these patches, please feel free to add:
>
> Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>

Thanks for the test.

I really feel a little envy for your fast Power system.
As my ARM board hasn't even finished one generic/auto run...

Thanks,
Qu

>
> -ritesh
>

Ritesh Harjani April 21, 2021, 12:15 p.m. UTC | #41

On 21/04/21 07:42PM, Qu Wenruo wrote:
>
>
> On 2021/4/21 下午7:13, riteshh wrote:
> > On 21/04/21 04:26PM, Qu Wenruo wrote:
> > >
> > >
> > > On 2021/4/21 下午3:30, riteshh wrote:
> > > > On 21/04/21 12:33PM, riteshh wrote:
> > > > > On 21/04/19 09:24PM, Qu Wenruo wrote:
> > > > > > [...]
> > > > > > > >
> > > > > > > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > > > > > > index 45ec3f5ef839..49f78d643392 100644
> > > > > > > > --- a/fs/btrfs/file.c
> > > > > > > > +++ b/fs/btrfs/file.c
> > > > > > > > @@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode
> > > > > > > > *inode,
> > > > > > > >                           unlock_page(page);
> > > > > > > >                           return -EIO;
> > > > > > > >                   }
> > > > > > > > -               if (page->mapping != inode->i_mapping) {
> > > > > > > > +
> > > > > > > > +               /*
> > > > > > > > +                * Since btrfs_readpage() will get the page unlocked, we
> > > > > > > > have
> > > > > > > > +                * a window where fadvice() can try to release the page.
> > > > > > > > +                * Here we check both inode mapping and PagePrivate() to
> > > > > > > > +                * make sure the page is not released.
> > > > > > > > +                *
> > > > > > > > +                * The priavte flag check is essential for subpage as we
> > > > > > > > need
> > > > > > > > +                * to store extra bitmap using page->private.
> > > > > > > > +                */
> > > > > > > > +               if (page->mapping != inode->i_mapping ||
> > > > > > > > PagePrivate(page)) {
> > > > > > >     ^ Obviously it should be !PagePrivate(page).
> > > > > >
> > > > > > Hi Ritesh,
> > > > > >
> > > > > > Mind to have another try on generic/095?
> > > > > >
> > > > > > This time the branch is updated with the following commit at top:
> > > > > >
> > > > > > commit d700b16dced6f2e2b47e1ca5588a92216ce84dfb (HEAD -> subpage,
> > > > > > github/subpage)
> > > > > > Author: Qu Wenruo <wqu@suse.com>
> > > > > > Date:   Mon Apr 19 13:41:31 2021 +0800
> > > > > >
> > > > > >       btrfs: fix a crash caused by race between prepare_pages() and
> > > > > >       btrfs_releasepage()
> > > > > >
> > > > > > The fix uses the PagePrivate() check to avoid the problem, and passes
> > > > > > several generic/auto loops without any sign of crash.
> > > > > >
> > > > > > But considering I always have difficult in reproducing the bug with previous
> > > > > > improper fix, your verification would be very helpful.
> > > > > >
> > > > >
> > > > > Hi Qu,
> > > > >
> > > > > Thanks for the patch. I did try above patch but even with this I could still
> > > > > reproduce the issue.
> > > > >
> > > > > 1. I think the original problem could be due to below logs.
> > > > > 	[   79.079641] run fstests generic/095 at 2021-04-21 06:46:23
> > > > > 	<...>
> > > > > 	[   83.634710] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> > > > >
> > > > > Meaning, there might be a race here between DIO and buffered IO.
> > > > > So from DIO path we call invalidate_inode_pages2_range(). Somehow this maybe
> > > > > causing call of btrfs_releasepage().
> > > > >
> > > > > Now from code, invalidate_inode_pages2_range() can be called from both
> > > > > __iomap_dio_rw() and from iomap_dio_complete(). So it is not clear as to from
> > > > > where this might be triggering this bug.
> > > >
> > > > I think I got one of the problem.
> > > > 1. we use page->private pointer as btrfs_subpage struct which also happens to
> > > >      hold spinlock within it.
> > > >
> > > >      Now in btrfs_subpage_clear_writeback()
> > > >      -> we take this spinlock  spin_lock_irqsave(&subpage->lock, flags);
> > > >      -> we call end_page_writeback(page);
> > > >      		  -> this may end up waking up invalidate_inode_pages2_range()
> > > > 		  which is waiting for writeback to complete.
> > > > 			  -> this then may also call btrfs_releasepage() on the
> > > > 			  same page and also free the subpage structure.
> > >
> > > This indeeds looks like a problem.
> > >
> > > This really means we need to have such a small race window below:
> > > (btrfs_invalidatepage() doesn't seem to be possible to race considering
> > >   how much work needed to be done in that function)
> > >
> > > 	Thread 1		|		Thread 2
> > > --------------------------------+------------------------------------
> > >   end_bio_extent_writepage()	| btrfs_releasepage()
> > >   |- spin_lock_irqsave()		| |
> > >   |- end_page_writeback()	| |
> > >   |				| |- if (PageWriteback() ||...)
> > >   |				| |- clear_page_extent_mapped()
> > >   |- spin_unlock_irqrestore().
> > >
> > > It looks like my arm boards are not fast enough to trigger the race.
> > >
> > > Although it can be fixed by doing the same thing as dirty bit, by checking
> > > the bitmap first and then call end_page_writeback() with spinlock unlocked.
> > >
> > > Would you please try the following fix? (based on the latest branch, which
> > > already has the previous fixes included).
> > >
> > > I'm also running the tests on all my arm boards to make sure it doesn't
> > > cause extra problem, so far so good, but my board is far from fast, thus not
> > > yet 100% tested.
> > >
> > > Thanks,
> > > Qu
> > >
> > > diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> > > index 696485ab68a2..c5abf9745c10 100644
> > > --- a/fs/btrfs/subpage.c
> > > +++ b/fs/btrfs/subpage.c
> > > @@ -420,13 +420,16 @@ void btrfs_subpage_clear_writeback(const struct
> > > btrfs_fs_info *fs_info,
> > >   {
> > >          struct btrfs_subpage *subpage = (struct btrfs_subpage
> > > *)page->private;
> > >          u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
> > > +       bool finished = false;
> > >          unsigned long flags;
> > >
> > >          spin_lock_irqsave(&subpage->lock, flags);
> > >          subpage->writeback_bitmap &= ~tmp;
> > >          if (subpage->writeback_bitmap == 0)
> > > -               end_page_writeback(page);
> > > +               finished = true;
> > >          spin_unlock_irqrestore(&subpage->lock, flags);
> > > +       if (finished)
> > > +               end_page_writeback(page);
> > >   }
> > >
> > >   void btrfs_subpage_set_ordered(const struct btrfs_fs_info *fs_info,
> >
> > Thanks for this patch. I have re-tested generic/095 with 100 iterations and -g
> > quick (with both of your patches). I don't see this issue anymore.
> > So with the two patches (including above one) the race with
> > btrfs_releasepage() is now fixed.
> >
> >
> > For both of these patches, please feel free to add:
> >
> > Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
> > Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
>
> Thanks for the test.
>
> I really feel a little envy for your fast Power system.
:)

> As my ARM board hasn't even finished one generic/auto run...
auto run could be slower. I am yet to do the full auto run testing with
your patch series on Power.
I just wanted to ensure that all the required configs(debug configs) required
are enabled, hence the delay.

-ritesh

[v3,00/13] btrfs: support read-write for subpage metadata

Message

Comments