[GIT,PULL] bcachefs

Message ID	20230626214656.hcp4puionmtoloat@moria.home.lan (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@vger.kernel.org> Date: Mon, 26 Jun 2023 17:46:56 -0400 From: Kent Overstreet <kent.overstreet@linux.dev> To: torvalds@linux-foundation.org Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-bcachefs@vger.kernel.org Subject: [GIT PULL] bcachefs Message-ID: <20230626214656.hcp4puionmtoloat@moria.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Precedence: bulk
Series	[GIT,PULL] bcachefs \| expand [GIT,PULL] bcachefs

Kent Overstreet June 26, 2023, 9:46 p.m. UTC

Hi Linus,

Here it is, the bcachefs pull request. For brevity the list of patches
below is only the initial part of the series, the non-bcachefs prep
patches and the first bcachefs patch, but the diffstat is for the entire
series.

Six locks has all the changes you suggested, text size went down
significantly. If you'd still like this to see more review from the
locking people, I'm not against them living in fs/bcachefs/ as an
interim; perhaps Dave could move them back to kernel/locking when he
starts using them or when locking people have had time to look at them -
I'm just hoping for this to not block the merge.

Recently some people have expressed concerns about "not wanting a repeat
of ntfs3" - from what I understand the issue there was just severe
buggyness, so perhaps showing the bcachefs automated test results will
help with that:

  https://evilpiepirate.org/~testdashboard/ci

The main bcachefs branch runs fstests and my own test suite in several
varations, including lockdep+kasan, preempt, and gcov (we're at 82% line
coverage); I'm not currently seeing any lockdep or kasan splats (or
panics/oopses, for that matter).

(Worth noting the bug causing the most test failures by a wide margin is
actually an io_uring bug that causes random umount failures in shutdown
tests. Would be great to get that looked at, it doesn't just affect
bcachefs).

Regarding feature status - most features are considered stable and ready
for use, snapshots and erasure coding are both nearly there. But a
filesystem on this scale is a massive project, adequately conveying the
status of every feature would take at least a page or two.

We may want to mark it as EXPERIMENTAL for a few releases, I haven't
done that as yet. (I wouldn't consider single device without snapshots
to be experimental, but - given that the number of users and bug reports
is about to shoot up, perhaps I should...).

Cheers,
Kent

---------

The following changes since commit 6995e2de6891c724bfeb2db33d7b87775f913ad1:

  Linux 6.4 (2023-06-25 16:29:58 -0700)

are available in the Git repository at:

  https://evilpiepirate.org/git/bcachefs.git bcachefs-for-upstream

for you to fetch changes up to 66012992c01af99f07ac696a8c9563ba291c1e7f

  bcachefs: fsck needs BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE

----------------------------------------------------------------
Christopher James Halse Rogers (1):
      stacktrace: Export stack_trace_save_tsk

Daniel Hill (1):
      lib: add mean and variance module.

Kent Overstreet (26):
      Compiler Attributes: add __flatten
      locking/lockdep: lock_class_is_held()
      locking/lockdep: lockdep_set_no_check_recursion()
      locking: SIX locks (shared/intent/exclusive)
      MAINTAINERS: Add entry for six locks
      sched: Add task_struct->faults_disabled_mapping
      fs: factor out d_mark_tmpfile()
      block: Add some exports for bcachefs
      block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset
      block: Bring back zero_fill_bio_iter
      block: Don't block on s_umount from __invalidate_super()
      bcache: move closures to lib/
      MAINTAINERS: Add entry for closures
      closures: closure_wait_event()
      closures: closure_nr_remaining()
      closures: Add a missing include
      iov_iter: copy_folio_from_iter_atomic()
      MAINTAINERS: Add entry for generic-radix-tree
      lib/generic-radix-tree.c: Don't overflow in peek()
      lib/generic-radix-tree.c: Add a missing include
      lib/generic-radix-tree.c: Add peek_prev()
      lib/string_helpers: string_get_size() now returns characters wrote
      lib: Export errname
      mean_and_variance: Assorted fixes/cleanups
      MAINTAINERS: Add entry for bcachefs
      bcachefs: Initial commit

 MAINTAINERS                                     |   39 +
 block/bdev.c                                    |    2 +-
 block/bio.c                                     |   18 +-
 block/blk-core.c                                |    1 +
 block/blk.h                                     |    1 -
 drivers/md/bcache/Kconfig                       |   10 +-
 drivers/md/bcache/Makefile                      |    4 +-
 drivers/md/bcache/bcache.h                      |    2 +-
 drivers/md/bcache/super.c                       |    1 -
 drivers/md/bcache/util.h                        |    3 +-
 fs/Kconfig                                      |    1 +
 fs/Makefile                                     |    1 +
 fs/bcachefs/Kconfig                             |   75 +
 fs/bcachefs/Makefile                            |   74 +
 fs/bcachefs/acl.c                               |  414 +++
 fs/bcachefs/acl.h                               |   58 +
 fs/bcachefs/alloc_background.c                  | 2212 +++++++++++++
 fs/bcachefs/alloc_background.h                  |  251 ++
 fs/bcachefs/alloc_foreground.c                  | 1534 +++++++++
 fs/bcachefs/alloc_foreground.h                  |  224 ++
 fs/bcachefs/alloc_types.h                       |  122 +
 fs/bcachefs/backpointers.c                      |  884 +++++
 fs/bcachefs/backpointers.h                      |  131 +
 fs/bcachefs/bbpos.h                             |   48 +
 fs/bcachefs/bcachefs.h                          | 1139 +++++++
 fs/bcachefs/bcachefs_format.h                   | 2219 +++++++++++++
 fs/bcachefs/bcachefs_ioctl.h                    |  368 +++
 fs/bcachefs/bkey.c                              | 1063 ++++++
 fs/bcachefs/bkey.h                              |  774 +++++
 fs/bcachefs/bkey_buf.h                          |   61 +
 fs/bcachefs/bkey_cmp.h                          |  129 +
 fs/bcachefs/bkey_methods.c                      |  520 +++
 fs/bcachefs/bkey_methods.h                      |  169 +
 fs/bcachefs/bkey_sort.c                         |  201 ++
 fs/bcachefs/bkey_sort.h                         |   44 +
 fs/bcachefs/bset.c                              | 1588 +++++++++
 fs/bcachefs/bset.h                              |  541 ++++
 fs/bcachefs/btree_cache.c                       | 1213 +++++++
 fs/bcachefs/btree_cache.h                       |  106 +
 fs/bcachefs/btree_gc.c                          | 2130 ++++++++++++
 fs/bcachefs/btree_gc.h                          |  112 +
 fs/bcachefs/btree_io.c                          | 2261 +++++++++++++
 fs/bcachefs/btree_io.h                          |  228 ++
 fs/bcachefs/btree_iter.c                        | 3214 ++++++++++++++++++
 fs/bcachefs/btree_iter.h                        |  916 ++++++
 fs/bcachefs/btree_key_cache.c                   | 1075 ++++++
 fs/bcachefs/btree_key_cache.h                   |   48 +
 fs/bcachefs/btree_locking.c                     |  804 +++++
 fs/bcachefs/btree_locking.h                     |  424 +++
 fs/bcachefs/btree_types.h                       |  731 +++++
 fs/bcachefs/btree_update.h                      |  358 ++
 fs/bcachefs/btree_update_interior.c             | 2476 ++++++++++++++
 fs/bcachefs/btree_update_interior.h             |  327 ++
 fs/bcachefs/btree_update_leaf.c                 | 2050 ++++++++++++
 fs/bcachefs/btree_write_buffer.c                |  343 ++
 fs/bcachefs/btree_write_buffer.h                |   14 +
 fs/bcachefs/btree_write_buffer_types.h          |   44 +
 fs/bcachefs/buckets.c                           | 2200 +++++++++++++
 fs/bcachefs/buckets.h                           |  370 +++
 fs/bcachefs/buckets_types.h                     |   92 +
 fs/bcachefs/buckets_waiting_for_journal.c       |  166 +
 fs/bcachefs/buckets_waiting_for_journal.h       |   15 +
 fs/bcachefs/buckets_waiting_for_journal_types.h |   23 +
 fs/bcachefs/chardev.c                           |  769 +++++
 fs/bcachefs/chardev.h                           |   31 +
 fs/bcachefs/checksum.c                          |  712 ++++
 fs/bcachefs/checksum.h                          |  215 ++
 fs/bcachefs/clock.c                             |  193 ++
 fs/bcachefs/clock.h                             |   38 +
 fs/bcachefs/clock_types.h                       |   37 +
 fs/bcachefs/compress.c                          |  638 ++++
 fs/bcachefs/compress.h                          |   18 +
 fs/bcachefs/counters.c                          |  107 +
 fs/bcachefs/counters.h                          |   17 +
 fs/bcachefs/darray.h                            |   87 +
 fs/bcachefs/data_update.c                       |  565 ++++
 fs/bcachefs/data_update.h                       |   43 +
 fs/bcachefs/debug.c                             |  957 ++++++
 fs/bcachefs/debug.h                             |   32 +
 fs/bcachefs/dirent.c                            |  564 ++++
 fs/bcachefs/dirent.h                            |   68 +
 fs/bcachefs/disk_groups.c                       |  548 ++++
 fs/bcachefs/disk_groups.h                       |  101 +
 fs/bcachefs/ec.c                                | 1957 +++++++++++
 fs/bcachefs/ec.h                                |  261 ++
 fs/bcachefs/ec_types.h                          |   41 +
 fs/bcachefs/errcode.c                           |   63 +
 fs/bcachefs/errcode.h                           |  243 ++
 fs/bcachefs/error.c                             |  297 ++
 fs/bcachefs/error.h                             |  213 ++
 fs/bcachefs/extent_update.c                     |  173 +
 fs/bcachefs/extent_update.h                     |   12 +
 fs/bcachefs/extents.c                           | 1384 ++++++++
 fs/bcachefs/extents.h                           |  755 +++++
 fs/bcachefs/extents_types.h                     |   40 +
 fs/bcachefs/eytzinger.h                         |  281 ++
 fs/bcachefs/fifo.h                              |  127 +
 fs/bcachefs/fs-common.c                         |  501 +++
 fs/bcachefs/fs-common.h                         |   43 +
 fs/bcachefs/fs-io.c                             | 3948 +++++++++++++++++++++++
 fs/bcachefs/fs-io.h                             |   54 +
 fs/bcachefs/fs-ioctl.c                          |  556 ++++
 fs/bcachefs/fs-ioctl.h                          |   81 +
 fs/bcachefs/fs.c                                | 1943 +++++++++++
 fs/bcachefs/fs.h                                |  206 ++
 fs/bcachefs/fsck.c                              | 2494 ++++++++++++++
 fs/bcachefs/fsck.h                              |    8 +
 fs/bcachefs/inode.c                             |  868 +++++
 fs/bcachefs/inode.h                             |  192 ++
 fs/bcachefs/io.c                                | 3056 ++++++++++++++++++
 fs/bcachefs/io.h                                |  202 ++
 fs/bcachefs/io_types.h                          |  165 +
 fs/bcachefs/journal.c                           | 1453 +++++++++
 fs/bcachefs/journal.h                           |  520 +++
 fs/bcachefs/journal_io.c                        | 1868 +++++++++++
 fs/bcachefs/journal_io.h                        |   64 +
 fs/bcachefs/journal_reclaim.c                   |  863 +++++
 fs/bcachefs/journal_reclaim.h                   |   86 +
 fs/bcachefs/journal_sb.c                        |  219 ++
 fs/bcachefs/journal_sb.h                        |   24 +
 fs/bcachefs/journal_seq_blacklist.c             |  322 ++
 fs/bcachefs/journal_seq_blacklist.h             |   22 +
 fs/bcachefs/journal_types.h                     |  358 ++
 fs/bcachefs/keylist.c                           |   52 +
 fs/bcachefs/keylist.h                           |   74 +
 fs/bcachefs/keylist_types.h                     |   16 +
 fs/bcachefs/lru.c                               |  178 +
 fs/bcachefs/lru.h                               |   63 +
 fs/bcachefs/migrate.c                           |  182 ++
 fs/bcachefs/migrate.h                           |    7 +
 fs/bcachefs/move.c                              | 1162 +++++++
 fs/bcachefs/move.h                              |   96 +
 fs/bcachefs/move_types.h                        |   36 +
 fs/bcachefs/movinggc.c                          |  420 +++
 fs/bcachefs/movinggc.h                          |   12 +
 fs/bcachefs/nocow_locking.c                     |  123 +
 fs/bcachefs/nocow_locking.h                     |   49 +
 fs/bcachefs/nocow_locking_types.h               |   20 +
 fs/bcachefs/opts.c                              |  555 ++++
 fs/bcachefs/opts.h                              |  543 ++++
 fs/bcachefs/printbuf.c                          |  415 +++
 fs/bcachefs/printbuf.h                          |  284 ++
 fs/bcachefs/quota.c                             |  980 ++++++
 fs/bcachefs/quota.h                             |   72 +
 fs/bcachefs/quota_types.h                       |   43 +
 fs/bcachefs/rebalance.c                         |  363 +++
 fs/bcachefs/rebalance.h                         |   28 +
 fs/bcachefs/rebalance_types.h                   |   26 +
 fs/bcachefs/recovery.c                          | 1648 ++++++++++
 fs/bcachefs/recovery.h                          |   58 +
 fs/bcachefs/reflink.c                           |  388 +++
 fs/bcachefs/reflink.h                           |   79 +
 fs/bcachefs/replicas.c                          | 1056 ++++++
 fs/bcachefs/replicas.h                          |   91 +
 fs/bcachefs/replicas_types.h                    |   27 +
 fs/bcachefs/seqmutex.h                          |   48 +
 fs/bcachefs/siphash.c                           |  173 +
 fs/bcachefs/siphash.h                           |   87 +
 fs/bcachefs/str_hash.h                          |  370 +++
 fs/bcachefs/subvolume.c                         | 1505 +++++++++
 fs/bcachefs/subvolume.h                         |  167 +
 fs/bcachefs/subvolume_types.h                   |   22 +
 fs/bcachefs/super-io.c                          | 1597 +++++++++
 fs/bcachefs/super-io.h                          |  126 +
 fs/bcachefs/super.c                             | 1993 ++++++++++++
 fs/bcachefs/super.h                             |  266 ++
 fs/bcachefs/super_types.h                       |   51 +
 fs/bcachefs/sysfs.c                             | 1064 ++++++
 fs/bcachefs/sysfs.h                             |   48 +
 fs/bcachefs/tests.c                             |  939 ++++++
 fs/bcachefs/tests.h                             |   15 +
 fs/bcachefs/trace.c                             |   16 +
 fs/bcachefs/trace.h                             | 1247 +++++++
 fs/bcachefs/two_state_shared_lock.c             |    8 +
 fs/bcachefs/two_state_shared_lock.h             |   59 +
 fs/bcachefs/util.c                              | 1137 +++++++
 fs/bcachefs/util.h                              |  842 +++++
 fs/bcachefs/varint.c                            |  121 +
 fs/bcachefs/varint.h                            |   11 +
 fs/bcachefs/vstructs.h                          |   63 +
 fs/bcachefs/xattr.c                             |  648 ++++
 fs/bcachefs/xattr.h                             |   51 +
 fs/dcache.c                                     |   12 +-
 fs/super.c                                      |   40 +-
 include/linux/bio.h                             |    7 +-
 include/linux/blkdev.h                          |    1 +
 {drivers/md/bcache => include/linux}/closure.h  |   46 +-
 include/linux/compiler_attributes.h             |    5 +
 include/linux/dcache.h                          |    1 +
 include/linux/exportfs.h                        |    6 +
 include/linux/fs.h                              |    1 +
 include/linux/generic-radix-tree.h              |   68 +-
 include/linux/lockdep.h                         |   10 +
 include/linux/lockdep_types.h                   |    2 +-
 include/linux/mean_and_variance.h               |  198 ++
 include/linux/sched.h                           |    1 +
 include/linux/six.h                             |  388 +++
 include/linux/string_helpers.h                  |    4 +-
 include/linux/uio.h                             |    2 +
 init/init_task.c                                |    1 +
 kernel/Kconfig.locks                            |    3 +
 kernel/locking/Makefile                         |    1 +
 kernel/locking/lockdep.c                        |   46 +
 kernel/locking/six.c                            |  893 +++++
 kernel/stacktrace.c                             |    2 +
 lib/Kconfig                                     |    3 +
 lib/Kconfig.debug                               |   18 +
 lib/Makefile                                    |    2 +
 {drivers/md/bcache => lib}/closure.c            |   36 +-
 lib/errname.c                                   |    1 +
 lib/generic-radix-tree.c                        |   76 +-
 lib/iov_iter.c                                  |   53 +-
 lib/math/Kconfig                                |    3 +
 lib/math/Makefile                               |    2 +
 lib/math/mean_and_variance.c                    |  158 +
 lib/math/mean_and_variance_test.c               |  239 ++
 lib/string_helpers.c                            |    8 +-
 217 files changed, 92440 insertions(+), 86 deletions(-)

Jens Axboe June 26, 2023, 11:11 p.m. UTC | #1

> (Worth noting the bug causing the most test failures by a wide margin is
> actually an io_uring bug that causes random umount failures in shutdown
> tests. Would be great to get that looked at, it doesn't just affect
> bcachefs).

Maybe if you had told someone about that it could get looked at?
What is the test case and what is going wrong?

>       block: Add some exports for bcachefs
>       block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset
>       block: Bring back zero_fill_bio_iter
>       block: Don't block on s_umount from __invalidate_super()

OK...

Kent Overstreet June 27, 2023, 12:06 a.m. UTC | #2

On Mon, Jun 26, 2023 at 05:11:29PM -0600, Jens Axboe wrote:
> > (Worth noting the bug causing the most test failures by a wide margin is
> > actually an io_uring bug that causes random umount failures in shutdown
> > tests. Would be great to get that looked at, it doesn't just affect
> > bcachefs).
> 
> Maybe if you had told someone about that it could get looked at?

I'm more likely to report bugs to people who have a history of being
responsive...

> What is the test case and what is going wrong?

Example: https://evilpiepirate.org/~testdashboard/c/82973f03c0683f7ecebe14dfaa2c3c9989dd29fc/xfstests.generic.388/log.br

I haven't personally seen it on xfs - Darrick knew something about it
but he's on vacation. If I track down a reproducer on xfs I'll let you
know.

If you're wanting to dig into it on bcachefs, ktest is pretty easy to
get going: https://evilpiepirate.org/git/ktest.git

  $ ~/ktest/root_image create
  # from your kernel tree:
  $ ~/ktest/build-test-kernel run -ILP ~/ktest/tests/bcachefs/xfstests.ktest/generic/388

I have some debug code I can give you from when I was tracing it through
the mount path, I still have to find or recreate the part that tracked
it down to io_uring...

Jens Axboe June 27, 2023, 1:13 a.m. UTC | #3

On 6/26/23 6:06?PM, Kent Overstreet wrote:
> On Mon, Jun 26, 2023 at 05:11:29PM -0600, Jens Axboe wrote:
>>> (Worth noting the bug causing the most test failures by a wide margin is
>>> actually an io_uring bug that causes random umount failures in shutdown
>>> tests. Would be great to get that looked at, it doesn't just affect
>>> bcachefs).
>>
>> Maybe if you had told someone about that it could get looked at?
> 
> I'm more likely to report bugs to people who have a history of being
> responsive...

I maintain the code I put in the kernel, and generally respond to
everything, and most certainly bug reports.

>> What is the test case and what is going wrong?
> 
> Example: https://evilpiepirate.org/~testdashboard/c/82973f03c0683f7ecebe14dfaa2c3c9989dd29fc/xfstests.generic.388/log.br
> 
> I haven't personally seen it on xfs - Darrick knew something about it
> but he's on vacation. If I track down a reproducer on xfs I'll let you
> know.
>
> If you're wanting to dig into it on bcachefs, ktest is pretty easy to
> get going: https://evilpiepirate.org/git/ktest.git
> 
>   $ ~/ktest/root_image create
>   # from your kernel tree:
>   $ ~/ktest/build-test-kernel run -ILP ~/ktest/tests/bcachefs/xfstests.ktest/generic/388
> 
> I have some debug code I can give you from when I was tracing it through
> the mount path, I still have to find or recreate the part that tracked
> it down to io_uring...

Doesn't reproduce for me with XFS. The above ktest doesn't work for me
either:

~/git/ktest/build-test-kernel run -ILP ~/git/ktest/tests/bcachefs/xfstests.ktest/generic/388
realpath: /home/axboe/git/ktest/tests/bcachefs/xfstests.ktest/generic/388: Not a directory
Error 1 at /home/axboe/git/ktest/build-test-kernel 262 from: ktest_test=$(realpath "$1"), exiting

and I suspect that should've been a space, but:

~/git/ktest/build-test-kernel run -ILP ~/git/ktest/tests/bcachefs/xfstests.ktest generic/388
Running test xfstests.ktest on m1max at /home/axboe/git/linux-block
No tests found
TEST FAILED

If I just run generic/388 with bcachefs formatted drives, I get xfstests
complaining as it tries to mount an XFS file system...

As a side note, I do get these when compiling:

fs/bcachefs/alloc_background.c: In function ‘bch2_check_alloc_info’:
fs/bcachefs/alloc_background.c:1526:1: warning: the frame size of 2640 bytes is larger than 2048 bytes [-Wframe-larger-than=]
 1526 | }
      | ^
fs/bcachefs/reflink.c: In function ‘bch2_remap_range’:
fs/bcachefs/reflink.c:388:1: warning: the frame size of 2352 bytes is larger than 2048 bytes [-Wframe-larger-than=]
  388 | }
      | ^

Kent Overstreet June 27, 2023, 2:05 a.m. UTC | #4

On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote:
> Doesn't reproduce for me with XFS. The above ktest doesn't work for me
> either:

It just popped for me on xfs, but it took half an hour or so of looping
vs. 30 seconds on bcachefs.

> ~/git/ktest/build-test-kernel run -ILP ~/git/ktest/tests/bcachefs/xfstests.ktest/generic/388
> realpath: /home/axboe/git/ktest/tests/bcachefs/xfstests.ktest/generic/388: Not a directory
> Error 1 at /home/axboe/git/ktest/build-test-kernel 262 from: ktest_test=$(realpath "$1"), exiting
> 
> and I suspect that should've been a space, but:
> 
> ~/git/ktest/build-test-kernel run -ILP ~/git/ktest/tests/bcachefs/xfstests.ktest generic/388
> Running test xfstests.ktest on m1max at /home/axboe/git/linux-block
> No tests found
> TEST FAILED

doh, this is because we just changed it to pick up the list of tests
from the test lists that fstests generated.

Go into ktest/tests/xfstests and run make and it'll work. (Doesn't
matter if make fails due to missing libraries, it'll re-run make inside
the VM where the dependencies will all be available).

> As a side note, I do get these when compiling:
> 
> fs/bcachefs/alloc_background.c: In function ‘bch2_check_alloc_info’:
> fs/bcachefs/alloc_background.c:1526:1: warning: the frame size of 2640 bytes is larger than 2048 bytes [-Wframe-larger-than=]
>  1526 | }
>       | ^
> fs/bcachefs/reflink.c: In function ‘bch2_remap_range’:
> fs/bcachefs/reflink.c:388:1: warning: the frame size of 2352 bytes is larger than 2048 bytes [-Wframe-larger-than=]
>   388 | }

yeah neither of those are super critical because they run top of the
stack, but they do need to be addressed. might be time to start heap
allocating btree_trans.

Kent Overstreet June 27, 2023, 2:33 a.m. UTC | #5

On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote:
> fs/bcachefs/alloc_background.c: In function ‘bch2_check_alloc_info’:
> fs/bcachefs/alloc_background.c:1526:1: warning: the frame size of 2640 bytes is larger than 2048 bytes [-Wframe-larger-than=]
>  1526 | }
>       | ^
> fs/bcachefs/reflink.c: In function ‘bch2_remap_range’:
> fs/bcachefs/reflink.c:388:1: warning: the frame size of 2352 bytes is larger than 2048 bytes [-Wframe-larger-than=]
>   388 | }
>       | ^

What version of gcc are you using? I'm not seeing either of those
warnings - I'm wondering if gcc recently got better about stack usage
when inlining.

also not seeing any reason why bch2_remap_range's stack frame should be
that big, to my eye it looks like it should be more like 1k, so if
anyone knows some magic for seeing stack frame layout that would be
handy...

anyways, there's a patch in my testing branch that should fix
bch2_check_alloc_info

Jens Axboe June 27, 2023, 2:59 a.m. UTC | #6

On 6/26/23 8:05?PM, Kent Overstreet wrote:
> On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote:
>> Doesn't reproduce for me with XFS. The above ktest doesn't work for me
>> either:
> 
> It just popped for me on xfs, but it took half an hour or so of looping
> vs. 30 seconds on bcachefs.

OK, I'll try and leave it running overnight and see if I can get it to
trigger.

>> ~/git/ktest/build-test-kernel run -ILP ~/git/ktest/tests/bcachefs/xfstests.ktest/generic/388
>> realpath: /home/axboe/git/ktest/tests/bcachefs/xfstests.ktest/generic/388: Not a directory
>> Error 1 at /home/axboe/git/ktest/build-test-kernel 262 from: ktest_test=$(realpath "$1"), exiting
>>
>> and I suspect that should've been a space, but:
>>
>> ~/git/ktest/build-test-kernel run -ILP ~/git/ktest/tests/bcachefs/xfstests.ktest generic/388
>> Running test xfstests.ktest on m1max at /home/axboe/git/linux-block
>> No tests found
>> TEST FAILED
> 
> doh, this is because we just changed it to pick up the list of tests
> from the test lists that fstests generated.
> 
> Go into ktest/tests/xfstests and run make and it'll work. (Doesn't
> matter if make fails due to missing libraries, it'll re-run make inside
> the VM where the dependencies will all be available).

OK, I'll try that as well.

BTW, ran into these too. Didn't do anything, it was just a mount and
umount trying to get the test going:

axboe@m1max-kvm ~/g/k/t/xfstests> sudo cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff000201a5e000 (size 1024):
  comm "bch-copygc/nvme", pid 11362, jiffies 4295015821 (age 6863.776s)
  hex dump (first 32 bytes):
    40 00 00 00 00 00 00 00 62 aa e8 ee 00 00 00 00  @.......b.......
    10 e0 a5 01 02 00 ff ff 10 e0 a5 01 02 00 ff ff  ................
  backtrace:
    [<000000002668da56>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<000000006b0b510c>] __kmem_cache_alloc_node+0xd0/0x178
    [<00000000041cfdde>] __kmalloc_node+0xac/0xd4
    [<00000000e1556d66>] kvmalloc_node+0x54/0xe4
    [<00000000df620afb>] bucket_table_alloc.isra.0+0x44/0x120
    [<000000005d44ce16>] rhashtable_init+0x148/0x1ac
    [<00000000fdca7475>] bch2_copygc_thread+0x50/0x2e4
    [<00000000ea76e08f>] kthread+0xc4/0xd4
    [<0000000068107ad6>] ret_from_fork+0x10/0x20
unreferenced object 0xffff000200eed800 (size 1024):
  comm "bch-copygc/nvme", pid 13934, jiffies 4295086192 (age 6582.296s)
  hex dump (first 32 bytes):
    40 00 00 00 00 00 00 00 e8 a5 2a bb 00 00 00 00  @.........*.....
    10 d8 ee 00 02 00 ff ff 10 d8 ee 00 02 00 ff ff  ................
  backtrace:
    [<000000002668da56>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<000000006b0b510c>] __kmem_cache_alloc_node+0xd0/0x178
    [<00000000041cfdde>] __kmalloc_node+0xac/0xd4
    [<00000000e1556d66>] kvmalloc_node+0x54/0xe4
    [<00000000df620afb>] bucket_table_alloc.isra.0+0x44/0x120
    [<000000005d44ce16>] rhashtable_init+0x148/0x1ac
    [<00000000fdca7475>] bch2_copygc_thread+0x50/0x2e4
    [<00000000ea76e08f>] kthread+0xc4/0xd4
    [<0000000068107ad6>] ret_from_fork+0x10/0x20

Jens Axboe June 27, 2023, 2:59 a.m. UTC | #7

On 6/26/23 8:33?PM, Kent Overstreet wrote:
> On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote:
>> fs/bcachefs/alloc_background.c: In function ?bch2_check_alloc_info?:
>> fs/bcachefs/alloc_background.c:1526:1: warning: the frame size of 2640 bytes is larger than 2048 bytes [-Wframe-larger-than=]
>>  1526 | }
>>       | ^
>> fs/bcachefs/reflink.c: In function ?bch2_remap_range?:
>> fs/bcachefs/reflink.c:388:1: warning: the frame size of 2352 bytes is larger than 2048 bytes [-Wframe-larger-than=]
>>   388 | }
>>       | ^
> 
> What version of gcc are you using? I'm not seeing either of those
> warnings - I'm wondering if gcc recently got better about stack usage
> when inlining.

Using:

gcc (Debian 13.1.0-6) 13.1.0

and it's on arm64, fwiw.

Kent Overstreet June 27, 2023, 3:10 a.m. UTC | #8

On Mon, Jun 26, 2023 at 08:59:13PM -0600, Jens Axboe wrote:
> On 6/26/23 8:05?PM, Kent Overstreet wrote:
> > On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote:
> >> Doesn't reproduce for me with XFS. The above ktest doesn't work for me
> >> either:
> > 
> > It just popped for me on xfs, but it took half an hour or so of looping
> > vs. 30 seconds on bcachefs.
> 
> OK, I'll try and leave it running overnight and see if I can get it to
> trigger.
> 
> >> ~/git/ktest/build-test-kernel run -ILP ~/git/ktest/tests/bcachefs/xfstests.ktest/generic/388
> >> realpath: /home/axboe/git/ktest/tests/bcachefs/xfstests.ktest/generic/388: Not a directory
> >> Error 1 at /home/axboe/git/ktest/build-test-kernel 262 from: ktest_test=$(realpath "$1"), exiting
> >>
> >> and I suspect that should've been a space, but:
> >>
> >> ~/git/ktest/build-test-kernel run -ILP ~/git/ktest/tests/bcachefs/xfstests.ktest generic/388
> >> Running test xfstests.ktest on m1max at /home/axboe/git/linux-block
> >> No tests found
> >> TEST FAILED
> > 
> > doh, this is because we just changed it to pick up the list of tests
> > from the test lists that fstests generated.
> > 
> > Go into ktest/tests/xfstests and run make and it'll work. (Doesn't
> > matter if make fails due to missing libraries, it'll re-run make inside
> > the VM where the dependencies will all be available).
> 
> OK, I'll try that as well.
> 
> BTW, ran into these too. Didn't do anything, it was just a mount and
> umount trying to get the test going:
> 
> axboe@m1max-kvm ~/g/k/t/xfstests> sudo cat /sys/kernel/debug/kmemleak
> unreferenced object 0xffff000201a5e000 (size 1024):
>   comm "bch-copygc/nvme", pid 11362, jiffies 4295015821 (age 6863.776s)
>   hex dump (first 32 bytes):
>     40 00 00 00 00 00 00 00 62 aa e8 ee 00 00 00 00  @.......b.......
>     10 e0 a5 01 02 00 ff ff 10 e0 a5 01 02 00 ff ff  ................
>   backtrace:
>     [<000000002668da56>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<000000006b0b510c>] __kmem_cache_alloc_node+0xd0/0x178
>     [<00000000041cfdde>] __kmalloc_node+0xac/0xd4
>     [<00000000e1556d66>] kvmalloc_node+0x54/0xe4
>     [<00000000df620afb>] bucket_table_alloc.isra.0+0x44/0x120
>     [<000000005d44ce16>] rhashtable_init+0x148/0x1ac
>     [<00000000fdca7475>] bch2_copygc_thread+0x50/0x2e4
>     [<00000000ea76e08f>] kthread+0xc4/0xd4
>     [<0000000068107ad6>] ret_from_fork+0x10/0x20
> unreferenced object 0xffff000200eed800 (size 1024):
>   comm "bch-copygc/nvme", pid 13934, jiffies 4295086192 (age 6582.296s)
>   hex dump (first 32 bytes):
>     40 00 00 00 00 00 00 00 e8 a5 2a bb 00 00 00 00  @.........*.....
>     10 d8 ee 00 02 00 ff ff 10 d8 ee 00 02 00 ff ff  ................
>   backtrace:
>     [<000000002668da56>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<000000006b0b510c>] __kmem_cache_alloc_node+0xd0/0x178
>     [<00000000041cfdde>] __kmalloc_node+0xac/0xd4
>     [<00000000e1556d66>] kvmalloc_node+0x54/0xe4
>     [<00000000df620afb>] bucket_table_alloc.isra.0+0x44/0x120
>     [<000000005d44ce16>] rhashtable_init+0x148/0x1ac
>     [<00000000fdca7475>] bch2_copygc_thread+0x50/0x2e4
>     [<00000000ea76e08f>] kthread+0xc4/0xd4
>     [<0000000068107ad6>] ret_from_fork+0x10/0x20

yup, missing a rhashtable_destroy() call, I'll do some kmemleak testing

Matthew Wilcox June 27, 2023, 3:19 a.m. UTC | #9

On Mon, Jun 26, 2023 at 08:59:44PM -0600, Jens Axboe wrote:
> On 6/26/23 8:33?PM, Kent Overstreet wrote:
> > On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote:
> >> fs/bcachefs/alloc_background.c: In function ?bch2_check_alloc_info?:
> >> fs/bcachefs/alloc_background.c:1526:1: warning: the frame size of 2640 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> >>  1526 | }
> >>       | ^
> >> fs/bcachefs/reflink.c: In function ?bch2_remap_range?:
> >> fs/bcachefs/reflink.c:388:1: warning: the frame size of 2352 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> >>   388 | }
> >>       | ^
> > 
> > What version of gcc are you using? I'm not seeing either of those
> > warnings - I'm wondering if gcc recently got better about stack usage
> > when inlining.
> 
> Using:
> 
> gcc (Debian 13.1.0-6) 13.1.0
> 
> and it's on arm64, fwiw.

OOI what PAGE_SIZE do you have configured?  Sometimes fs data structures
are PAGE_SIZE dependent (haven't looked at this particular bcachefs data
structure).  We've also had weirdness with various gcc versions on some
architectures making different inlining decisions from x86.

Kent Overstreet June 27, 2023, 3:22 a.m. UTC | #10

On Tue, Jun 27, 2023 at 04:19:33AM +0100, Matthew Wilcox wrote:
> On Mon, Jun 26, 2023 at 08:59:44PM -0600, Jens Axboe wrote:
> > On 6/26/23 8:33?PM, Kent Overstreet wrote:
> > > On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote:
> > >> fs/bcachefs/alloc_background.c: In function ?bch2_check_alloc_info?:
> > >> fs/bcachefs/alloc_background.c:1526:1: warning: the frame size of 2640 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> > >>  1526 | }
> > >>       | ^
> > >> fs/bcachefs/reflink.c: In function ?bch2_remap_range?:
> > >> fs/bcachefs/reflink.c:388:1: warning: the frame size of 2352 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> > >>   388 | }
> > >>       | ^
> > > 
> > > What version of gcc are you using? I'm not seeing either of those
> > > warnings - I'm wondering if gcc recently got better about stack usage
> > > when inlining.
> > 
> > Using:
> > 
> > gcc (Debian 13.1.0-6) 13.1.0
> > 
> > and it's on arm64, fwiw.
> 
> OOI what PAGE_SIZE do you have configured?  Sometimes fs data structures
> are PAGE_SIZE dependent (haven't looked at this particular bcachefs data
> structure).  We've also had weirdness with various gcc versions on some
> architectures making different inlining decisions from x86.

There are very few references to PAGE_SIZE in bcachefs, I've killed off
as much of that as I can because all this code has to work in userspace
too and depending on PAGE_SIZE is sketchy there

Christoph Hellwig June 27, 2023, 3:52 a.m. UTC | #11

Nacked-by: Christoph Hellwig <hch@lst.de>

Kent,

you really need to feed your prep patches through the maintainers
instead of just starting fights everywhere.  And you really need
someone else than your vouch for the code in the form of a co-maintainer
or reviewer.

Kent Overstreet June 27, 2023, 4:36 a.m. UTC | #12

On Mon, Jun 26, 2023 at 08:52:39PM -0700, Christoph Hellwig wrote:
> 
> Nacked-by: Christoph Hellwig <hch@lst.de>
> 
> Kent,
> 
> you really need to feed your prep patches through the maintainers
> instead of just starting fights everywhere.  And you really need
> someone else than your vouch for the code in the form of a co-maintainer
> or reviewer.

If there's a patch that you think is at issue, then I invite you to
point it out. Just please deliver your explanation with more technical
precision and less vitriol - thanks.

Jens Axboe June 27, 2023, 5:16 p.m. UTC | #13

On 6/26/23 8:59?PM, Jens Axboe wrote:
> On 6/26/23 8:05?PM, Kent Overstreet wrote:
>> On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote:
>>> Doesn't reproduce for me with XFS. The above ktest doesn't work for me
>>> either:
>>
>> It just popped for me on xfs, but it took half an hour or so of looping
>> vs. 30 seconds on bcachefs.
> 
> OK, I'll try and leave it running overnight and see if I can get it to
> trigger.

I did manage to reproduce it, and also managed to get bcachefs to run
the test. But I had to add:

diff --git a/check b/check
index 5f9f1a6bec88..6d74bd4933bd 100755
--- a/check
+++ b/check
@@ -283,7 +283,7 @@ while [ $# -gt 0 ]; do
 	case "$1" in
 	-\? | -h | --help) usage ;;
 
-	-nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs)
+	-nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs|-bcachefs)
 		FSTYP="${1:1}"
 		;;
 	-overlay)

to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept
failing because it assumed it was XFS.

I suspected this was just a timing issue, and it looks like that's
exactly what it is. Looking at the test case, it'll randomly kill -9
fsstress, and if that happens while we have io_uring IO pending, then we
process completions inline (for a PF_EXITING current). This means they
get pushed to fallback work, which runs out of line. If we hit that case
AND the timing is such that it hasn't been processed yet, we'll still be
holding a file reference under the mount point and umount will -EBUSY
fail.

As far as I can tell, this can happen with aio as well, it's just harder
to hit. If the fput happens while the task is exiting, then fput will
end up being delayed through a workqueue as well. The test case assumes
that once it's reaped the exit of the killed task that all files are
released, which isn't necessarily true if they are done out-of-line.

For io_uring specifically, it may make sense to wait on the fallback
work. The below patch does this, and should fix the issue. But I'm not
fully convinced that this is really needed, as I do think this can
happen without io_uring as well. It just doesn't right now as the test
does buffered IO, and aio will be fully sync with buffered IO. That
means there's either no gap where aio will hit it without O_DIRECT, or
it's just small enough that it hasn't been hit.


diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3bca7a79efda..7abad5cb2131 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -150,7 +150,6 @@ static void io_clean_op(struct io_kiocb *req);
 static void io_queue_sqe(struct io_kiocb *req);
 static void io_move_task_work_from_local(struct io_ring_ctx *ctx);
 static void __io_submit_flush_completions(struct io_ring_ctx *ctx);
-static __cold void io_fallback_tw(struct io_uring_task *tctx);
 
 struct kmem_cache *req_cachep;
 
@@ -1248,6 +1247,49 @@ static inline struct llist_node *io_llist_cmpxchg(struct llist_head *head,
 	return cmpxchg(&head->first, old, new);
 }
 
+#define NR_FALLBACK_CTX	8
+
+static __cold void io_flush_fallback(struct io_ring_ctx **ctxs, int *nr_ctx)
+{
+	int i;
+
+	for (i = 0; i < *nr_ctx; i++) {
+		struct io_ring_ctx *ctx = ctxs[i];
+
+		flush_delayed_work(&ctx->fallback_work);
+		percpu_ref_put(&ctx->refs);
+	}
+	*nr_ctx = 0;
+}
+
+static __cold void io_flush_fallback_add(struct io_ring_ctx *ctx,
+					 struct io_ring_ctx **ctxs, int *nr_ctx)
+{
+	percpu_ref_get(&ctx->refs);
+	ctxs[(*nr_ctx)++] = ctx;
+	if (*nr_ctx == NR_FALLBACK_CTX)
+		io_flush_fallback(ctxs, nr_ctx);
+}
+
+static __cold void io_fallback_tw(struct io_uring_task *tctx, bool sync)
+{
+	struct llist_node *node = llist_del_all(&tctx->task_list);
+	struct io_ring_ctx *ctxs[NR_FALLBACK_CTX];
+	struct io_kiocb *req;
+	int nr_ctx = 0;
+
+	while (node) {
+		req = container_of(node, struct io_kiocb, io_task_work.node);
+		node = node->next;
+		if (sync)
+			io_flush_fallback_add(req->ctx, ctxs, &nr_ctx);
+		if (llist_add(&req->io_task_work.node,
+			      &req->ctx->fallback_llist))
+			schedule_delayed_work(&req->ctx->fallback_work, 1);
+	}
+	io_flush_fallback(ctxs, &nr_ctx);
+}
+
 void tctx_task_work(struct callback_head *cb)
 {
 	struct io_tw_state ts = {};
@@ -1260,7 +1302,7 @@ void tctx_task_work(struct callback_head *cb)
 	unsigned int count = 0;
 
 	if (unlikely(current->flags & PF_EXITING)) {
-		io_fallback_tw(tctx);
+		io_fallback_tw(tctx, true);
 		return;
 	}
 
@@ -1289,20 +1331,6 @@ void tctx_task_work(struct callback_head *cb)
 	trace_io_uring_task_work_run(tctx, count, loops);
 }
 
-static __cold void io_fallback_tw(struct io_uring_task *tctx)
-{
-	struct llist_node *node = llist_del_all(&tctx->task_list);
-	struct io_kiocb *req;
-
-	while (node) {
-		req = container_of(node, struct io_kiocb, io_task_work.node);
-		node = node->next;
-		if (llist_add(&req->io_task_work.node,
-			      &req->ctx->fallback_llist))
-			schedule_delayed_work(&req->ctx->fallback_work, 1);
-	}
-}
-
 static void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
@@ -1377,7 +1405,7 @@ void __io_req_task_work_add(struct io_kiocb *req, unsigned flags)
 	if (likely(!task_work_add(req->task, &tctx->task_work, ctx->notify_method)))
 		return;
 
-	io_fallback_tw(tctx);
+	io_fallback_tw(tctx, false);
 }
 
 static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)

Kent Overstreet June 27, 2023, 8:15 p.m. UTC | #14

On Tue, Jun 27, 2023 at 11:16:01AM -0600, Jens Axboe wrote:
> On 6/26/23 8:59?PM, Jens Axboe wrote:
> > On 6/26/23 8:05?PM, Kent Overstreet wrote:
> >> On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote:
> >>> Doesn't reproduce for me with XFS. The above ktest doesn't work for me
> >>> either:
> >>
> >> It just popped for me on xfs, but it took half an hour or so of looping
> >> vs. 30 seconds on bcachefs.
> > 
> > OK, I'll try and leave it running overnight and see if I can get it to
> > trigger.
> 
> I did manage to reproduce it, and also managed to get bcachefs to run
> the test. But I had to add:
> 
> diff --git a/check b/check
> index 5f9f1a6bec88..6d74bd4933bd 100755
> --- a/check
> +++ b/check
> @@ -283,7 +283,7 @@ while [ $# -gt 0 ]; do
>  	case "$1" in
>  	-\? | -h | --help) usage ;;
>  
> -	-nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs)
> +	-nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs|-bcachefs)
>  		FSTYP="${1:1}"
>  		;;
>  	-overlay)

I wonder if this is due to an upstream fstests change I haven't seen
yet, I'll have a look.

> to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept
> failing because it assumed it was XFS.
> 
> I suspected this was just a timing issue, and it looks like that's
> exactly what it is. Looking at the test case, it'll randomly kill -9
> fsstress, and if that happens while we have io_uring IO pending, then we
> process completions inline (for a PF_EXITING current). This means they
> get pushed to fallback work, which runs out of line. If we hit that case
> AND the timing is such that it hasn't been processed yet, we'll still be
> holding a file reference under the mount point and umount will -EBUSY
> fail.
> 
> As far as I can tell, this can happen with aio as well, it's just harder
> to hit. If the fput happens while the task is exiting, then fput will
> end up being delayed through a workqueue as well. The test case assumes
> that once it's reaped the exit of the killed task that all files are
> released, which isn't necessarily true if they are done out-of-line.

Yeah, I traced it through to the delayed fput code as well.

I'm not sure delayed fput is responsible here; what I learned when I was
tracking this down has mostly fell out of my brain, so take anything I
say with a large grain of salt. But I believe I tested with delayed_fput
completely disabled, and found another thing in io_uring with the same
effect as delayed_fput that wasn't being flushed.

> For io_uring specifically, it may make sense to wait on the fallback
> work. The below patch does this, and should fix the issue. But I'm not
> fully convinced that this is really needed, as I do think this can
> happen without io_uring as well. It just doesn't right now as the test
> does buffered IO, and aio will be fully sync with buffered IO. That
> means there's either no gap where aio will hit it without O_DIRECT, or
> it's just small enough that it hasn't been hit.

I just tried your patch and I still have generic/388 failing - it
might've taken a bit longer to pop this time.

I wonder if there might be a better way of solving this though? For aio,
when a process is exiting we just synchronously tear down the ioctx,
including waiting for outstanding iocbs.

delayed_fput, even though I believe not responsible here, seems sketchy
to me because there doesn't seem to be a straightforward way to flush
delayed fputs for a given _process_ - there's a single global work item,
and we can only flush globally.

Would what aio does work here?

(disclaimer: I haven't studied the io_uring code so I haven't figured
out the approach your patch is taking yet)

Dave Chinner June 27, 2023, 10:05 p.m. UTC | #15

On Tue, Jun 27, 2023 at 04:15:24PM -0400, Kent Overstreet wrote:
> On Tue, Jun 27, 2023 at 11:16:01AM -0600, Jens Axboe wrote:
> > On 6/26/23 8:59?PM, Jens Axboe wrote:
> > > On 6/26/23 8:05?PM, Kent Overstreet wrote:
> > >> On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote:
> > >>> Doesn't reproduce for me with XFS. The above ktest doesn't work for me
> > >>> either:
> > >>
> > >> It just popped for me on xfs, but it took half an hour or so of looping
> > >> vs. 30 seconds on bcachefs.
> > > 
> > > OK, I'll try and leave it running overnight and see if I can get it to
> > > trigger.
> > 
> > I did manage to reproduce it, and also managed to get bcachefs to run
> > the test. But I had to add:
> > 
> > diff --git a/check b/check
> > index 5f9f1a6bec88..6d74bd4933bd 100755
> > --- a/check
> > +++ b/check
> > @@ -283,7 +283,7 @@ while [ $# -gt 0 ]; do
> >  	case "$1" in
> >  	-\? | -h | --help) usage ;;
> >  
> > -	-nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs)
> > +	-nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs|-bcachefs)
> >  		FSTYP="${1:1}"
> >  		;;
> >  	-overlay)
> 
> I wonder if this is due to an upstream fstests change I haven't seen
> yet, I'll have a look.

Run mkfs.bcachefs on the testdir first. fstests tries to probe the
filesystem type to test if $FSTYP is not set. If it doesn't find a
filesystem or it is unsupported, it will use the default (i.e. XFS).
There should be no reason to need to specify the filesystem type for
filesystems that blkid recognises. from common/config:

        # Autodetect fs type based on what's on $TEST_DEV unless it's been set
        # externally
        if [ -z "$FSTYP" ] && [ ! -z "$TEST_DEV" ]; then
                FSTYP=`blkid -c /dev/null -s TYPE -o value $TEST_DEV`
        fi
        FSTYP=${FSTYP:=xfs}
        export FSTYP

Cheers,

Dave.

Kent Overstreet June 27, 2023, 10:41 p.m. UTC | #16

On Wed, Jun 28, 2023 at 08:05:06AM +1000, Dave Chinner wrote:
> On Tue, Jun 27, 2023 at 04:15:24PM -0400, Kent Overstreet wrote:
> > On Tue, Jun 27, 2023 at 11:16:01AM -0600, Jens Axboe wrote:
> > > On 6/26/23 8:59?PM, Jens Axboe wrote:
> > > > On 6/26/23 8:05?PM, Kent Overstreet wrote:
> > > >> On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote:
> > > >>> Doesn't reproduce for me with XFS. The above ktest doesn't work for me
> > > >>> either:
> > > >>
> > > >> It just popped for me on xfs, but it took half an hour or so of looping
> > > >> vs. 30 seconds on bcachefs.
> > > > 
> > > > OK, I'll try and leave it running overnight and see if I can get it to
> > > > trigger.
> > > 
> > > I did manage to reproduce it, and also managed to get bcachefs to run
> > > the test. But I had to add:
> > > 
> > > diff --git a/check b/check
> > > index 5f9f1a6bec88..6d74bd4933bd 100755
> > > --- a/check
> > > +++ b/check
> > > @@ -283,7 +283,7 @@ while [ $# -gt 0 ]; do
> > >  	case "$1" in
> > >  	-\? | -h | --help) usage ;;
> > >  
> > > -	-nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs)
> > > +	-nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs|-bcachefs)
> > >  		FSTYP="${1:1}"
> > >  		;;
> > >  	-overlay)
> > 
> > I wonder if this is due to an upstream fstests change I haven't seen
> > yet, I'll have a look.
> 
> Run mkfs.bcachefs on the testdir first. fstests tries to probe the
> filesystem type to test if $FSTYP is not set. If it doesn't find a
> filesystem or it is unsupported, it will use the default (i.e. XFS).
> There should be no reason to need to specify the filesystem type for
> filesystems that blkid recognises. from common/config:

Actually ktest already does that, and it sets $FSTYP as well. Jens, are
you sure you weren't doing something funny?

Jens Axboe June 28, 2023, 3:16 a.m. UTC | #17

On 6/27/23 2:15?PM, Kent Overstreet wrote:
>> to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept
>> failing because it assumed it was XFS.
>>
>> I suspected this was just a timing issue, and it looks like that's
>> exactly what it is. Looking at the test case, it'll randomly kill -9
>> fsstress, and if that happens while we have io_uring IO pending, then we
>> process completions inline (for a PF_EXITING current). This means they
>> get pushed to fallback work, which runs out of line. If we hit that case
>> AND the timing is such that it hasn't been processed yet, we'll still be
>> holding a file reference under the mount point and umount will -EBUSY
>> fail.
>>
>> As far as I can tell, this can happen with aio as well, it's just harder
>> to hit. If the fput happens while the task is exiting, then fput will
>> end up being delayed through a workqueue as well. The test case assumes
>> that once it's reaped the exit of the killed task that all files are
>> released, which isn't necessarily true if they are done out-of-line.
> 
> Yeah, I traced it through to the delayed fput code as well.
> 
> I'm not sure delayed fput is responsible here; what I learned when I was
> tracking this down has mostly fell out of my brain, so take anything I
> say with a large grain of salt. But I believe I tested with delayed_fput
> completely disabled, and found another thing in io_uring with the same
> effect as delayed_fput that wasn't being flushed.

I'm not saying it's delayed_fput(), I'm saying it's the delayed putting
io_uring can end up doing. But yes, delayed_fput() is another candidate.

>> For io_uring specifically, it may make sense to wait on the fallback
>> work. The below patch does this, and should fix the issue. But I'm not
>> fully convinced that this is really needed, as I do think this can
>> happen without io_uring as well. It just doesn't right now as the test
>> does buffered IO, and aio will be fully sync with buffered IO. That
>> means there's either no gap where aio will hit it without O_DIRECT, or
>> it's just small enough that it hasn't been hit.
> 
> I just tried your patch and I still have generic/388 failing - it
> might've taken a bit longer to pop this time.

Yep see the same here. Didn't have time to look into it after sending
that email today, just took a quick stab at writing a reproducer and
ended up crashing bcachefs:

[ 1122.384909] workqueue: Failed to create a rescuer kthread for wq "bcachefs": -EINTR
[ 1122.384915] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 1122.385814] Mem abort info:
[ 1122.385962]   ESR = 0x0000000096000004
[ 1122.386161]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 1122.386444]   SET = 0, FnV = 0
[ 1122.386612]   EA = 0, S1PTW = 0
[ 1122.386842]   FSC = 0x04: level 0 translation fault
[ 1122.387168] Data abort info:
[ 1122.387321]   ISV = 0, ISS = 0x00000004
[ 1122.387518]   CM = 0, WnR = 0
[ 1122.387676] user pgtable: 4k pages, 48-bit VAs, pgdp=00000001133da000
[ 1122.388014] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[ 1122.388363] Internal error: Oops: 0000000096000004 [#1] SMP
[ 1122.388659] Modules linked in:
[ 1122.388866] CPU: 4 PID: 23129 Comm: mount Not tainted 6.4.0-02556-ge61c7fc22b68-dirty #3647
[ 1122.389389] Hardware name: linux,dummy-virt (DT)
[ 1122.389682] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 1122.390118] pc : bch2_free_pending_node_rewrites+0x40/0x90
[ 1122.390466] lr : bch2_free_pending_node_rewrites+0x28/0x90
[ 1122.390815] sp : ffff80002481b770
[ 1122.391030] x29: ffff80002481b770 x28: ffff0000e1d24000 x27: 00000000fffff7b7
[ 1122.391475] x26: 0000000000000000 x25: ffff0000e1d00040 x24: dead000000000122
[ 1122.391919] x23: dead000000000100 x22: ffff0000e1d031b8 x21: ffff0000e1d00040
[ 1122.392366] x20: 0000000000000000 x19: ffff0000e1d031a8 x18: 0000000000000009
[ 1122.392860] x17: 3a22736665686361 x16: 6362222071772072 x15: 6f66206461657268
[ 1122.393622] x14: 746b207265756373 x13: 52544e49452d203a x12: 0000000000000001
[ 1122.395170] x11: 0000000000000001 x10: 0000000000000000 x9 : 00000000000002d3
[ 1122.396592] x8 : 00000000000003f8 x7 : 0000000000000000 x6 : ffff8000093c2e78
[ 1122.397970] x5 : ffff000209de4240 x4 : ffff0000e1d00030 x3 : dead000000000122
[ 1122.399263] x2 : 00000000000031a8 x1 : 0000000000000000 x0 : 0000000000000000
[ 1122.400473] Call trace:
[ 1122.400908]  bch2_free_pending_node_rewrites+0x40/0x90
[ 1122.401783]  bch2_fs_release+0x48/0x24c
[ 1122.402589]  kobject_put+0x7c/0xe8
[ 1122.403271]  bch2_fs_free+0xa4/0xc8
[ 1122.404033]  bch2_fs_alloc+0x5c8/0xbcc
[ 1122.404888]  bch2_fs_open+0x19c/0x430
[ 1122.405781]  bch2_mount+0x194/0x45c
[ 1122.406643]  legacy_get_tree+0x2c/0x54
[ 1122.407476]  vfs_get_tree+0x28/0xd4
[ 1122.408251]  path_mount+0x5d0/0x6c8
[ 1122.409056]  do_mount+0x80/0xa4
[ 1122.409866]  __arm64_sys_mount+0x150/0x168
[ 1122.410904]  invoke_syscall.constprop.0+0x70/0xb8
[ 1122.411890]  do_el0_svc+0xbc/0xf0
[ 1122.412596]  el0_svc+0x74/0x9c
[ 1122.413343]  el0t_64_sync_handler+0xa8/0x134
[ 1122.414148]  el0t_64_sync+0x168/0x16c
[ 1122.414863] Code: f2fbd5b7 d2863502 91008af8 8b020273 (f85d8695) 
[ 1122.415939] ---[ end trace 0000000000000000 ]---

> I wonder if there might be a better way of solving this though? For aio,
> when a process is exiting we just synchronously tear down the ioctx,
> including waiting for outstanding iocbs.

aio is pretty trivial, because the only async it supports is O_DIRECT
on regular files which always completes in finite time. io_uring has to
cancel etc, so we need to do a lot more.

But the concept of my patch should be fine, but I think we must be
missing a case. Which is why I started writing a small reproducer
instead. I'll pick it up again tomorrow and see what is going on here.

> delayed_fput, even though I believe not responsible here, seems sketchy
> to me because there doesn't seem to be a straightforward way to flush
> delayed fputs for a given _process_ - there's a single global work item,
> and we can only flush globally.

Yep as mentioned I don't think it's delayed_fput at all. And yeah we can
only globally flush that, not per task/files_struct.

Kent Overstreet June 28, 2023, 4:01 a.m. UTC | #18

On Tue, Jun 27, 2023 at 09:16:31PM -0600, Jens Axboe wrote:
> On 6/27/23 2:15?PM, Kent Overstreet wrote:
> >> to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept
> >> failing because it assumed it was XFS.
> >>
> >> I suspected this was just a timing issue, and it looks like that's
> >> exactly what it is. Looking at the test case, it'll randomly kill -9
> >> fsstress, and if that happens while we have io_uring IO pending, then we
> >> process completions inline (for a PF_EXITING current). This means they
> >> get pushed to fallback work, which runs out of line. If we hit that case
> >> AND the timing is such that it hasn't been processed yet, we'll still be
> >> holding a file reference under the mount point and umount will -EBUSY
> >> fail.
> >>
> >> As far as I can tell, this can happen with aio as well, it's just harder
> >> to hit. If the fput happens while the task is exiting, then fput will
> >> end up being delayed through a workqueue as well. The test case assumes
> >> that once it's reaped the exit of the killed task that all files are
> >> released, which isn't necessarily true if they are done out-of-line.
> > 
> > Yeah, I traced it through to the delayed fput code as well.
> > 
> > I'm not sure delayed fput is responsible here; what I learned when I was
> > tracking this down has mostly fell out of my brain, so take anything I
> > say with a large grain of salt. But I believe I tested with delayed_fput
> > completely disabled, and found another thing in io_uring with the same
> > effect as delayed_fput that wasn't being flushed.
> 
> I'm not saying it's delayed_fput(), I'm saying it's the delayed putting
> io_uring can end up doing. But yes, delayed_fput() is another candidate.

Sorry - was just working through my recollections/initial thought
process out loud

> >> For io_uring specifically, it may make sense to wait on the fallback
> >> work. The below patch does this, and should fix the issue. But I'm not
> >> fully convinced that this is really needed, as I do think this can
> >> happen without io_uring as well. It just doesn't right now as the test
> >> does buffered IO, and aio will be fully sync with buffered IO. That
> >> means there's either no gap where aio will hit it without O_DIRECT, or
> >> it's just small enough that it hasn't been hit.
> > 
> > I just tried your patch and I still have generic/388 failing - it
> > might've taken a bit longer to pop this time.
> 
> Yep see the same here. Didn't have time to look into it after sending
> that email today, just took a quick stab at writing a reproducer and
> ended up crashing bcachefs:

You must have hit an error before we finished initializing the
filesystem, the list head never got initialized. Patch for that will be
in the testing branch momentarily.

> > I wonder if there might be a better way of solving this though? For aio,
> > when a process is exiting we just synchronously tear down the ioctx,
> > including waiting for outstanding iocbs.
> 
> aio is pretty trivial, because the only async it supports is O_DIRECT
> on regular files which always completes in finite time. io_uring has to
> cancel etc, so we need to do a lot more.

ahh yes, buffered IO would complicate things

> But the concept of my patch should be fine, but I think we must be
> missing a case. Which is why I started writing a small reproducer
> instead. I'll pick it up again tomorrow and see what is going on here.

Ok. Soon as you've got a patch I'll throw it at my CI, or I can point my
CI at your branch if you have one.

Jens Axboe June 28, 2023, 2:40 p.m. UTC | #19

On 6/27/23 4:05?PM, Dave Chinner wrote:
> On Tue, Jun 27, 2023 at 04:15:24PM -0400, Kent Overstreet wrote:
>> On Tue, Jun 27, 2023 at 11:16:01AM -0600, Jens Axboe wrote:
>>> On 6/26/23 8:59?PM, Jens Axboe wrote:
>>>> On 6/26/23 8:05?PM, Kent Overstreet wrote:
>>>>> On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote:
>>>>>> Doesn't reproduce for me with XFS. The above ktest doesn't work for me
>>>>>> either:
>>>>>
>>>>> It just popped for me on xfs, but it took half an hour or so of looping
>>>>> vs. 30 seconds on bcachefs.
>>>>
>>>> OK, I'll try and leave it running overnight and see if I can get it to
>>>> trigger.
>>>
>>> I did manage to reproduce it, and also managed to get bcachefs to run
>>> the test. But I had to add:
>>>
>>> diff --git a/check b/check
>>> index 5f9f1a6bec88..6d74bd4933bd 100755
>>> --- a/check
>>> +++ b/check
>>> @@ -283,7 +283,7 @@ while [ $# -gt 0 ]; do
>>>  	case "$1" in
>>>  	-\? | -h | --help) usage ;;
>>>  
>>> -	-nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs)
>>> +	-nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs|-bcachefs)
>>>  		FSTYP="${1:1}"
>>>  		;;
>>>  	-overlay)
>>
>> I wonder if this is due to an upstream fstests change I haven't seen
>> yet, I'll have a look.
> 
> Run mkfs.bcachefs on the testdir first. fstests tries to probe the
> filesystem type to test if $FSTYP is not set. If it doesn't find a
> filesystem or it is unsupported, it will use the default (i.e. XFS).

I did format both test and scratch first with bcachefs, so guessing
something is going wrong with figuring out what filesystem is on the
device and then it defaults to XFS. I didn't spend too much time on that
bit, figured it was easier to just force bcachefs for my purpose.

> There should be no reason to need to specify the filesystem type for
> filesystems that blkid recognises. from common/config:
> 
>         # Autodetect fs type based on what's on $TEST_DEV unless it's been set
>         # externally
>         if [ -z "$FSTYP" ] && [ ! -z "$TEST_DEV" ]; then
>                 FSTYP=`blkid -c /dev/null -s TYPE -o value $TEST_DEV`
>         fi
>         FSTYP=${FSTYP:=xfs}
>         export FSTYP

Gotcha, yep it's because blkid fails to figure it out.

Thomas Weißschuh June 28, 2023, 2:48 p.m. UTC | #20

On 2023-06-28 08:40:27-0600, Jens Axboe wrote:
> On 6/27/23 4:05?PM, Dave Chinner wrote:
> [..]

> > There should be no reason to need to specify the filesystem type for
> > filesystems that blkid recognises. from common/config:
> > 
> >         # Autodetect fs type based on what's on $TEST_DEV unless it's been set
> >         # externally
> >         if [ -z "$FSTYP" ] && [ ! -z "$TEST_DEV" ]; then
> >                 FSTYP=`blkid -c /dev/null -s TYPE -o value $TEST_DEV`
> >         fi
> >         FSTYP=${FSTYP:=xfs}
> >         export FSTYP
> 
> Gotcha, yep it's because blkid fails to figure it out.

This needs blkid/util-linux version 2.39 which is fairly new.
If it doesn't work with that, it's a bug.

Thomas

Jens Axboe June 28, 2023, 2:58 p.m. UTC | #21

On 6/27/23 10:01?PM, Kent Overstreet wrote:
> On Tue, Jun 27, 2023 at 09:16:31PM -0600, Jens Axboe wrote:
>> On 6/27/23 2:15?PM, Kent Overstreet wrote:
>>>> to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept
>>>> failing because it assumed it was XFS.
>>>>
>>>> I suspected this was just a timing issue, and it looks like that's
>>>> exactly what it is. Looking at the test case, it'll randomly kill -9
>>>> fsstress, and if that happens while we have io_uring IO pending, then we
>>>> process completions inline (for a PF_EXITING current). This means they
>>>> get pushed to fallback work, which runs out of line. If we hit that case
>>>> AND the timing is such that it hasn't been processed yet, we'll still be
>>>> holding a file reference under the mount point and umount will -EBUSY
>>>> fail.
>>>>
>>>> As far as I can tell, this can happen with aio as well, it's just harder
>>>> to hit. If the fput happens while the task is exiting, then fput will
>>>> end up being delayed through a workqueue as well. The test case assumes
>>>> that once it's reaped the exit of the killed task that all files are
>>>> released, which isn't necessarily true if they are done out-of-line.
>>>
>>> Yeah, I traced it through to the delayed fput code as well.
>>>
>>> I'm not sure delayed fput is responsible here; what I learned when I was
>>> tracking this down has mostly fell out of my brain, so take anything I
>>> say with a large grain of salt. But I believe I tested with delayed_fput
>>> completely disabled, and found another thing in io_uring with the same
>>> effect as delayed_fput that wasn't being flushed.
>>
>> I'm not saying it's delayed_fput(), I'm saying it's the delayed putting
>> io_uring can end up doing. But yes, delayed_fput() is another candidate.
> 
> Sorry - was just working through my recollections/initial thought
> process out loud

No worries, it might actually be a combination and this is why my
io_uring side patch didn't fully resolve it. Wrote a simple reproducer
and it seems to reliably trigger it, but is fixed with an flush of the
delayed fput list on mount -EBUSY return. Still digging...

>>>> For io_uring specifically, it may make sense to wait on the fallback
>>>> work. The below patch does this, and should fix the issue. But I'm not
>>>> fully convinced that this is really needed, as I do think this can
>>>> happen without io_uring as well. It just doesn't right now as the test
>>>> does buffered IO, and aio will be fully sync with buffered IO. That
>>>> means there's either no gap where aio will hit it without O_DIRECT, or
>>>> it's just small enough that it hasn't been hit.
>>>
>>> I just tried your patch and I still have generic/388 failing - it
>>> might've taken a bit longer to pop this time.
>>
>> Yep see the same here. Didn't have time to look into it after sending
>> that email today, just took a quick stab at writing a reproducer and
>> ended up crashing bcachefs:
> 
> You must have hit an error before we finished initializing the
> filesystem, the list head never got initialized. Patch for that will be
> in the testing branch momentarily.

I'll pull that in. In testing just now, I hit a few more leaks:

unreferenced object 0xffff0000e55cf200 (size 128):
  comm "mount", pid 723, jiffies 4294899134 (age 85.868s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<000000001d69062c>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<00000000c503def2>] __kmem_cache_alloc_node+0xd0/0x178
    [<00000000cde48528>] __kmalloc+0xac/0xd4
    [<000000006cb9446a>] kmalloc_array.constprop.0+0x18/0x20
    [<000000008341b32c>] bch2_fs_alloc+0x73c/0xbcc
    [<000000003b8339fd>] bch2_fs_open+0x19c/0x430
    [<00000000aef40a23>] bch2_mount+0x194/0x45c
    [<0000000005e49357>] legacy_get_tree+0x2c/0x54
    [<00000000f5813622>] vfs_get_tree+0x28/0xd4
    [<00000000ea6972ec>] path_mount+0x5d0/0x6c8
    [<00000000468ec307>] do_mount+0x80/0xa4
    [<00000000ea5d305d>] __arm64_sys_mount+0x150/0x168
    [<00000000da6d98cb>] invoke_syscall.constprop.0+0x70/0xb8
    [<000000008f20c487>] do_el0_svc+0xbc/0xf0
    [<00000000a1018c2c>] el0_svc+0x74/0x9c
    [<00000000fc46d579>] el0t_64_sync_handler+0xa8/0x134
unreferenced object 0xffff0000e55cf580 (size 128):
  comm "mount", pid 723, jiffies 4294899134 (age 85.868s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<000000001d69062c>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<00000000c503def2>] __kmem_cache_alloc_node+0xd0/0x178
    [<00000000cde48528>] __kmalloc+0xac/0xd4
    [<0000000097f806f1>] __prealloc_shrinker+0x3c/0x60
    [<000000008ff20762>] register_shrinker+0x14/0x34
    [<000000007fa7e36c>] bch2_fs_btree_cache_init+0xf8/0x150
    [<000000005135a635>] bch2_fs_alloc+0x7ac/0xbcc
    [<000000003b8339fd>] bch2_fs_open+0x19c/0x430
    [<00000000aef40a23>] bch2_mount+0x194/0x45c
    [<0000000005e49357>] legacy_get_tree+0x2c/0x54
    [<00000000f5813622>] vfs_get_tree+0x28/0xd4
    [<00000000ea6972ec>] path_mount+0x5d0/0x6c8
    [<00000000468ec307>] do_mount+0x80/0xa4
    [<00000000ea5d305d>] __arm64_sys_mount+0x150/0x168
    [<00000000da6d98cb>] invoke_syscall.constprop.0+0x70/0xb8
    [<000000008f20c487>] do_el0_svc+0xbc/0xf0
unreferenced object 0xffff0000e55cf480 (size 128):
  comm "mount", pid 723, jiffies 4294899134 (age 85.868s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<000000001d69062c>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<00000000c503def2>] __kmem_cache_alloc_node+0xd0/0x178
    [<00000000cde48528>] __kmalloc+0xac/0xd4
    [<0000000097f806f1>] __prealloc_shrinker+0x3c/0x60
    [<000000008ff20762>] register_shrinker+0x14/0x34
    [<000000003d050c32>] bch2_fs_btree_key_cache_init+0x88/0x90
    [<00000000d9f351c0>] bch2_fs_alloc+0x7c0/0xbcc
    [<000000003b8339fd>] bch2_fs_open+0x19c/0x430
    [<00000000aef40a23>] bch2_mount+0x194/0x45c
    [<0000000005e49357>] legacy_get_tree+0x2c/0x54
    [<00000000f5813622>] vfs_get_tree+0x28/0xd4
    [<00000000ea6972ec>] path_mount+0x5d0/0x6c8
    [<00000000468ec307>] do_mount+0x80/0xa4
    [<00000000ea5d305d>] __arm64_sys_mount+0x150/0x168
    [<00000000da6d98cb>] invoke_syscall.constprop.0+0x70/0xb8
    [<000000008f20c487>] do_el0_svc+0xbc/0xf0

>>> I wonder if there might be a better way of solving this though? For aio,
>>> when a process is exiting we just synchronously tear down the ioctx,
>>> including waiting for outstanding iocbs.
>>
>> aio is pretty trivial, because the only async it supports is O_DIRECT
>> on regular files which always completes in finite time. io_uring has to
>> cancel etc, so we need to do a lot more.
> 
> ahh yes, buffered IO would complicate things
> 
>> But the concept of my patch should be fine, but I think we must be
>> missing a case. Which is why I started writing a small reproducer
>> instead. I'll pick it up again tomorrow and see what is going on here.
> 
> Ok. Soon as you've got a patch I'll throw it at my CI, or I can point my
> CI at your branch if you have one.

I should have something later today, don't feel like I fully understand
all of it just yet.

Jens Axboe June 28, 2023, 2:58 p.m. UTC | #22

On 6/28/23 8:48 AM, Thomas Weißschuh wrote:
> On 2023-06-28 08:40:27-0600, Jens Axboe wrote:
>> On 6/27/23 4:05?PM, Dave Chinner wrote:
>> [..]
> 
>>> There should be no reason to need to specify the filesystem type for
>>> filesystems that blkid recognises. from common/config:
>>>
>>>         # Autodetect fs type based on what's on $TEST_DEV unless it's been set
>>>         # externally
>>>         if [ -z "$FSTYP" ] && [ ! -z "$TEST_DEV" ]; then
>>>                 FSTYP=`blkid -c /dev/null -s TYPE -o value $TEST_DEV`
>>>         fi
>>>         FSTYP=${FSTYP:=xfs}
>>>         export FSTYP
>>
>> Gotcha, yep it's because blkid fails to figure it out.
> 
> This needs blkid/util-linux version 2.39 which is fairly new.
> If it doesn't work with that, it's a bug.

Got it, looks like I have 2.38.1 here.

Jens Axboe June 28, 2023, 3:22 p.m. UTC | #23

On 6/28/23 8:58?AM, Jens Axboe wrote:
> I should have something later today, don't feel like I fully understand
> all of it just yet.

Might indeed be delayed_fput, just the flush is a bit broken in that it
races with the worker doing the flush. In any case, with testing that, I
hit this before I got an umount failure on loop 6 of generic/388:


External UUID:                              724c7f1e-fed4-46e8-888a-2d5b170365b7
Internal UUID:                              4c356134-e573-4aa4-a7b6-c22ab260e0ff
Device index:                               0
Label:                                      
Version:                                    snapshot_trees
Oldest version on disk:                     snapshot_trees
Created:                                    Wed Jun 28 09:16:47 2023
Sequence number:                            0
Superblock size:                            816
Clean:                                      0
Devices:                                    1
Sections:                                   members
Features:                                   new_siphash,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes
Compat features:                            

Options:
  block_size:                               512 B
  btree_node_size:                          256 KiB
  errors:                                   continue [ro] panic 
  metadata_replicas:                        1
  data_replicas:                            1
  metadata_replicas_required:               1
  data_replicas_required:                   1
  encoded_extent_max:                       64.0 KiB
  metadata_checksum:                        none [crc32c] crc64 xxhash 
  data_checksum:                            none [crc32c] crc64 xxhash 
  compression:                              [none] lz4 gzip zstd 
  background_compression:                   [none] lz4 gzip zstd 
  str_hash:                                 crc32c crc64 [siphash] 
  metadata_target:                          none
  foreground_target:                        none
  background_target:                        none
  promote_target:                           none
  erasure_code:                             0
  inodes_32bit:                             1
  shard_inode_numbers:                      1
  inodes_use_key_cache:                     1
  gc_reserve_percent:                       8
  gc_reserve_bytes:                         0 B
  root_reserve_percent:                     0
  wide_macs:                                0
  acl:                                      1
  usrquota:                                 0
  grpquota:                                 0
  prjquota:                                 0
  journal_flush_delay:                      1000
  journal_flush_disabled:                   0
  journal_reclaim_delay:                    100
  journal_transaction_names:                1
  nocow:                                    0

members (size 64):
  Device:                                   0
    UUID:                                   dea79b51-ed22-4f11-9cb9-2117240419df
    Size:                                   20.0 GiB
    Bucket size:                            256 KiB
    First bucket:                           0
    Buckets:                                81920
    Last mount:                             (never)
    State:                                  rw
    Label:                                  (none)
    Data allowed:                           journal,btree,user
    Has data:                               (none)
    Discard:                                0
    Freespace initialized:                  0
initializing new filesystem
going read-write
initializing freespace
mounted version=snapshot_trees
seed = 1687442369
seed = 1687347478
seed = 1687934778
seed = 1687706987
seed = 1687173946
seed = 1687488122
seed = 1687828133
seed = 1687316163
seed = 1687511704
seed = 1687772088
seed = 1688057713
seed = 1687321139
seed = 1687166901
seed = 1687602318
seed = 1687659981
seed = 1687457702
seed = 1688000542
seed = 1687221947
seed = 1687740111
seed = 1688083754
seed = 1687314115
seed = 1687189436
seed = 1687664679
seed = 1687631074
seed = 1687691080
seed = 1688089920
seed = 1687962494
seed = 1687646206
seed = 1687636790
seed = 1687442248
seed = 1687532669
seed = 1687436103
seed = 1687626640
seed = 1687594091
seed = 1687235023
seed = 1687525509
seed = 1687766818
seed = 1688040782
seed = 1687293628
seed = 1687468804
seed = 1688129968
seed = 1687176698
seed = 1687603782
seed = 1687642709
seed = 1687844382
seed = 1687696290
seed = 1688169221
_check_generic_filesystem: filesystem on /dev/nvme0n1 is inconsistent
*** fsck.bcachefs output ***
fsck from util-linux 2.38.1
recovering from clean shutdown, journal seq 14642
journal read done, replaying entries 14642-14642
checking allocations
starting journal replay, 0 keys
checking need_discard and freespace btrees
checking lrus
checking backpointers to alloc keys
checking backpointers to extents
backpointer for missing extent
  u64s 9 type backpointer 0:7950303232:0 len 0 ver 0: bucket=0:15164:0 btree=extents l=0 offset=0:0 len=88 pos=1342182431:5745:U32_MAX, not fixing
checking extents to backpointers
checking alloc to lru refs
starting fsck
going read-write
mounted version=snapshot_trees opts=degraded,fsck,fix_errors,nochanges
0xaaaafeb6b580g: still has errors
*** end fsck.bcachefs output
*** mount output ***
/dev/vda2 on / type ext4 (rw,relatime,errors=remount-ro)
devtmpfs on /dev type devtmpfs (rw,relatime,size=8174296k,nr_inodes=2043574,mode=755)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,size=3276876k,nr_inodes=819200,mode=755)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
ramfs on /run/credentials/systemd-sysctl.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
ramfs on /run/credentials/systemd-sysusers.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
ramfs on /run/credentials/systemd-tmpfiles-setup-dev.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
/dev/vda1 on /boot/efi type vfat (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)
ramfs on /run/credentials/systemd-tmpfiles-setup.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
*** end mount output

Jens Axboe June 28, 2023, 4:57 p.m. UTC | #24

On 6/28/23 8:58?AM, Jens Axboe wrote:
> On 6/27/23 10:01?PM, Kent Overstreet wrote:
>> On Tue, Jun 27, 2023 at 09:16:31PM -0600, Jens Axboe wrote:
>>> On 6/27/23 2:15?PM, Kent Overstreet wrote:
>>>>> to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept
>>>>> failing because it assumed it was XFS.
>>>>>
>>>>> I suspected this was just a timing issue, and it looks like that's
>>>>> exactly what it is. Looking at the test case, it'll randomly kill -9
>>>>> fsstress, and if that happens while we have io_uring IO pending, then we
>>>>> process completions inline (for a PF_EXITING current). This means they
>>>>> get pushed to fallback work, which runs out of line. If we hit that case
>>>>> AND the timing is such that it hasn't been processed yet, we'll still be
>>>>> holding a file reference under the mount point and umount will -EBUSY
>>>>> fail.
>>>>>
>>>>> As far as I can tell, this can happen with aio as well, it's just harder
>>>>> to hit. If the fput happens while the task is exiting, then fput will
>>>>> end up being delayed through a workqueue as well. The test case assumes
>>>>> that once it's reaped the exit of the killed task that all files are
>>>>> released, which isn't necessarily true if they are done out-of-line.
>>>>
>>>> Yeah, I traced it through to the delayed fput code as well.
>>>>
>>>> I'm not sure delayed fput is responsible here; what I learned when I was
>>>> tracking this down has mostly fell out of my brain, so take anything I
>>>> say with a large grain of salt. But I believe I tested with delayed_fput
>>>> completely disabled, and found another thing in io_uring with the same
>>>> effect as delayed_fput that wasn't being flushed.
>>>
>>> I'm not saying it's delayed_fput(), I'm saying it's the delayed putting
>>> io_uring can end up doing. But yes, delayed_fput() is another candidate.
>>
>> Sorry - was just working through my recollections/initial thought
>> process out loud
> 
> No worries, it might actually be a combination and this is why my
> io_uring side patch didn't fully resolve it. Wrote a simple reproducer
> and it seems to reliably trigger it, but is fixed with an flush of the
> delayed fput list on mount -EBUSY return. Still digging...

I discussed this with Christian offline. I have a patch that is pretty
simple, but it does mean that you'd wait for delayed fput flush off
umount. Which seems kind of iffy.

I think we need to back up a bit and consider if the kill && umount
really is sane. If you kill a task that has open files, then any fput
from that task will end up being delayed. This means that the umount may
very well fail.

It'd be handy if we could have umount wait for that to finish, but I'm
not at all confident this is a sane solution for all cases. And as
discussed, we have no way to even identify which files we'd need to
flush out of the delayed list.

Maybe the test case just needs fixing? Christian suggested lazy/detach
umount and wait for sb release. There's an fsnotify hook for that,
fsnotify_sb_delete(). Obviously this is a bit more involved, but seems
to me that this would be the way to make it more reliable when killing
of tasks with open files are involved.

Christian Brauner June 28, 2023, 5:33 p.m. UTC | #25

On Wed, Jun 28, 2023 at 10:57:02AM -0600, Jens Axboe wrote:
> On 6/28/23 8:58?AM, Jens Axboe wrote:
> > On 6/27/23 10:01?PM, Kent Overstreet wrote:
> >> On Tue, Jun 27, 2023 at 09:16:31PM -0600, Jens Axboe wrote:
> >>> On 6/27/23 2:15?PM, Kent Overstreet wrote:
> >>>>> to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept
> >>>>> failing because it assumed it was XFS.
> >>>>>
> >>>>> I suspected this was just a timing issue, and it looks like that's
> >>>>> exactly what it is. Looking at the test case, it'll randomly kill -9
> >>>>> fsstress, and if that happens while we have io_uring IO pending, then we
> >>>>> process completions inline (for a PF_EXITING current). This means they
> >>>>> get pushed to fallback work, which runs out of line. If we hit that case
> >>>>> AND the timing is such that it hasn't been processed yet, we'll still be
> >>>>> holding a file reference under the mount point and umount will -EBUSY
> >>>>> fail.
> >>>>>
> >>>>> As far as I can tell, this can happen with aio as well, it's just harder
> >>>>> to hit. If the fput happens while the task is exiting, then fput will
> >>>>> end up being delayed through a workqueue as well. The test case assumes
> >>>>> that once it's reaped the exit of the killed task that all files are
> >>>>> released, which isn't necessarily true if they are done out-of-line.
> >>>>
> >>>> Yeah, I traced it through to the delayed fput code as well.
> >>>>
> >>>> I'm not sure delayed fput is responsible here; what I learned when I was
> >>>> tracking this down has mostly fell out of my brain, so take anything I
> >>>> say with a large grain of salt. But I believe I tested with delayed_fput
> >>>> completely disabled, and found another thing in io_uring with the same
> >>>> effect as delayed_fput that wasn't being flushed.
> >>>
> >>> I'm not saying it's delayed_fput(), I'm saying it's the delayed putting
> >>> io_uring can end up doing. But yes, delayed_fput() is another candidate.
> >>
> >> Sorry - was just working through my recollections/initial thought
> >> process out loud
> > 
> > No worries, it might actually be a combination and this is why my
> > io_uring side patch didn't fully resolve it. Wrote a simple reproducer
> > and it seems to reliably trigger it, but is fixed with an flush of the
> > delayed fput list on mount -EBUSY return. Still digging...
> 
> I discussed this with Christian offline. I have a patch that is pretty
> simple, but it does mean that you'd wait for delayed fput flush off
> umount. Which seems kind of iffy.
> 
> I think we need to back up a bit and consider if the kill && umount
> really is sane. If you kill a task that has open files, then any fput
> from that task will end up being delayed. This means that the umount may
> very well fail.

That's why we have MNT_DETACH:

umount2("/mnt", MNT_DETACH)

will succeed even if fds are still open.

> 
> It'd be handy if we could have umount wait for that to finish, but I'm
> not at all confident this is a sane solution for all cases. And as
> discussed, we have no way to even identify which files we'd need to
> flush out of the delayed list.
> 
> Maybe the test case just needs fixing? Christian suggested lazy/detach
> umount and wait for sb release. There's an fsnotify hook for that,
> fsnotify_sb_delete(). Obviously this is a bit more involved, but seems
> to me that this would be the way to make it more reliable when killing
> of tasks with open files are involved.

You can wait on superblock destruction today in multiple ways. Roughly
from the shell this should work:

        root@wittgenstein:~# cat sb_wait.sh
        #! /bin/bash
        
        echo "WARNING WARNING: I SUCK AT SHELL SCRIPTS"
        
        echo "mount fs"
        sudo mount -t tmpfs tmpfs /mnt
        touch /mnt/bla
        
        echo "pin sb by random file for 5s"
        sleep 5 > /mnt/bla &
        
        echo "establish inotify watch for sb destruction"
        inotifywait -e unmount /mnt &
        pid=$!
        
        echo "regular umount - will fail..."
        umount /mnt
        
        findmnt | grep "/mnt"
        
        echo "lazily umount - will succeed"
        umount -l /mnt
        
        findmnt | grep "/mnt"
        
        echo "and now we wait"
        wait $!
        
        echo "done"

Can also use a tiny bpf lsm, fanotify in the future as we plans for
that.

Kent Overstreet June 28, 2023, 5:52 p.m. UTC | #26

On Wed, Jun 28, 2023 at 10:57:02AM -0600, Jens Axboe wrote:
> I discussed this with Christian offline. I have a patch that is pretty
> simple, but it does mean that you'd wait for delayed fput flush off
> umount. Which seems kind of iffy.
> 
> I think we need to back up a bit and consider if the kill && umount
> really is sane. If you kill a task that has open files, then any fput
> from that task will end up being delayed. This means that the umount may
> very well fail.
> 
> It'd be handy if we could have umount wait for that to finish, but I'm
> not at all confident this is a sane solution for all cases. And as
> discussed, we have no way to even identify which files we'd need to
> flush out of the delayed list.
> 
> Maybe the test case just needs fixing? Christian suggested lazy/detach
> umount and wait for sb release. There's an fsnotify hook for that,
> fsnotify_sb_delete(). Obviously this is a bit more involved, but seems
> to me that this would be the way to make it more reliable when killing
> of tasks with open files are involved.

No, this is a real breakage. Any time we introduce unexpected
asynchrony there's the potential for breakage: case in point, there was
a filesystem that made rm asynchronous, then there were scripts out
there that deleted until df showed under some threshold.. whoops...

this would break anyone that does fuser; umount; and making the umount
lazy just moves the race to the next thing that uses the block device.

I'd like to know how delayed_fput() avoids this.

Kent Overstreet June 28, 2023, 5:54 p.m. UTC | #27

On Wed, Jun 28, 2023 at 08:58:45AM -0600, Jens Axboe wrote:
> On 6/27/23 10:01?PM, Kent Overstreet wrote:
> > On Tue, Jun 27, 2023 at 09:16:31PM -0600, Jens Axboe wrote:
> >> On 6/27/23 2:15?PM, Kent Overstreet wrote:
> >>>> to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept
> >>>> failing because it assumed it was XFS.
> >>>>
> >>>> I suspected this was just a timing issue, and it looks like that's
> >>>> exactly what it is. Looking at the test case, it'll randomly kill -9
> >>>> fsstress, and if that happens while we have io_uring IO pending, then we
> >>>> process completions inline (for a PF_EXITING current). This means they
> >>>> get pushed to fallback work, which runs out of line. If we hit that case
> >>>> AND the timing is such that it hasn't been processed yet, we'll still be
> >>>> holding a file reference under the mount point and umount will -EBUSY
> >>>> fail.
> >>>>
> >>>> As far as I can tell, this can happen with aio as well, it's just harder
> >>>> to hit. If the fput happens while the task is exiting, then fput will
> >>>> end up being delayed through a workqueue as well. The test case assumes
> >>>> that once it's reaped the exit of the killed task that all files are
> >>>> released, which isn't necessarily true if they are done out-of-line.
> >>>
> >>> Yeah, I traced it through to the delayed fput code as well.
> >>>
> >>> I'm not sure delayed fput is responsible here; what I learned when I was
> >>> tracking this down has mostly fell out of my brain, so take anything I
> >>> say with a large grain of salt. But I believe I tested with delayed_fput
> >>> completely disabled, and found another thing in io_uring with the same
> >>> effect as delayed_fput that wasn't being flushed.
> >>
> >> I'm not saying it's delayed_fput(), I'm saying it's the delayed putting
> >> io_uring can end up doing. But yes, delayed_fput() is another candidate.
> > 
> > Sorry - was just working through my recollections/initial thought
> > process out loud
> 
> No worries, it might actually be a combination and this is why my
> io_uring side patch didn't fully resolve it. Wrote a simple reproducer
> and it seems to reliably trigger it, but is fixed with an flush of the
> delayed fput list on mount -EBUSY return. Still digging...
> 
> >>>> For io_uring specifically, it may make sense to wait on the fallback
> >>>> work. The below patch does this, and should fix the issue. But I'm not
> >>>> fully convinced that this is really needed, as I do think this can
> >>>> happen without io_uring as well. It just doesn't right now as the test
> >>>> does buffered IO, and aio will be fully sync with buffered IO. That
> >>>> means there's either no gap where aio will hit it without O_DIRECT, or
> >>>> it's just small enough that it hasn't been hit.
> >>>
> >>> I just tried your patch and I still have generic/388 failing - it
> >>> might've taken a bit longer to pop this time.
> >>
> >> Yep see the same here. Didn't have time to look into it after sending
> >> that email today, just took a quick stab at writing a reproducer and
> >> ended up crashing bcachefs:
> > 
> > You must have hit an error before we finished initializing the
> > filesystem, the list head never got initialized. Patch for that will be
> > in the testing branch momentarily.
> 
> I'll pull that in. In testing just now, I hit a few more leaks:
> 
> unreferenced object 0xffff0000e55cf200 (size 128):
>   comm "mount", pid 723, jiffies 4294899134 (age 85.868s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace:
>     [<000000001d69062c>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<00000000c503def2>] __kmem_cache_alloc_node+0xd0/0x178
>     [<00000000cde48528>] __kmalloc+0xac/0xd4
>     [<000000006cb9446a>] kmalloc_array.constprop.0+0x18/0x20
>     [<000000008341b32c>] bch2_fs_alloc+0x73c/0xbcc

Can you faddr2line this? I just did a bunch of kmemleak testing and
didn't see it.

> unreferenced object 0xffff0000e55cf480 (size 128):
>   comm "mount", pid 723, jiffies 4294899134 (age 85.868s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace:
>     [<000000001d69062c>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<00000000c503def2>] __kmem_cache_alloc_node+0xd0/0x178
>     [<00000000cde48528>] __kmalloc+0xac/0xd4
>     [<0000000097f806f1>] __prealloc_shrinker+0x3c/0x60
>     [<000000008ff20762>] register_shrinker+0x14/0x34
>     [<000000003d050c32>] bch2_fs_btree_key_cache_init+0x88/0x90
>     [<00000000d9f351c0>] bch2_fs_alloc+0x7c0/0xbcc
>     [<000000003b8339fd>] bch2_fs_open+0x19c/0x430
>     [<00000000aef40a23>] bch2_mount+0x194/0x45c
>     [<0000000005e49357>] legacy_get_tree+0x2c/0x54
>     [<00000000f5813622>] vfs_get_tree+0x28/0xd4
>     [<00000000ea6972ec>] path_mount+0x5d0/0x6c8
>     [<00000000468ec307>] do_mount+0x80/0xa4
>     [<00000000ea5d305d>] __arm64_sys_mount+0x150/0x168
>     [<00000000da6d98cb>] invoke_syscall.constprop.0+0x70/0xb8
>     [<000000008f20c487>] do_el0_svc+0xbc/0xf0

This one is actually a bug in unregister_shrinker(), I have a patch I'll
have to send to Andrew.

Kent Overstreet June 28, 2023, 5:56 p.m. UTC | #28

On Wed, Jun 28, 2023 at 09:22:15AM -0600, Jens Axboe wrote:
> On 6/28/23 8:58?AM, Jens Axboe wrote:
> > I should have something later today, don't feel like I fully understand
> > all of it just yet.
> 
> Might indeed be delayed_fput, just the flush is a bit broken in that it
> races with the worker doing the flush. In any case, with testing that, I
> hit this before I got an umount failure on loop 6 of generic/388:
> 
> fsck from util-linux 2.38.1
> recovering from clean shutdown, journal seq 14642
> journal read done, replaying entries 14642-14642
> checking allocations
> starting journal replay, 0 keys
> checking need_discard and freespace btrees
> checking lrus
> checking backpointers to alloc keys
> checking backpointers to extents
> backpointer for missing extent
>   u64s 9 type backpointer 0:7950303232:0 len 0 ver 0: bucket=0:15164:0 btree=extents l=0 offset=0:0 len=88 pos=1342182431:5745:U32_MAX, not fixing

Known bug, but it's gotten difficult to reproduce - if generic/388 ends
up being a better reproducer for this that'll be nice

Jens Axboe June 28, 2023, 8:44 p.m. UTC | #29

On 6/28/23 11:52?AM, Kent Overstreet wrote:
> On Wed, Jun 28, 2023 at 10:57:02AM -0600, Jens Axboe wrote:
>> I discussed this with Christian offline. I have a patch that is pretty
>> simple, but it does mean that you'd wait for delayed fput flush off
>> umount. Which seems kind of iffy.
>>
>> I think we need to back up a bit and consider if the kill && umount
>> really is sane. If you kill a task that has open files, then any fput
>> from that task will end up being delayed. This means that the umount may
>> very well fail.
>>
>> It'd be handy if we could have umount wait for that to finish, but I'm
>> not at all confident this is a sane solution for all cases. And as
>> discussed, we have no way to even identify which files we'd need to
>> flush out of the delayed list.
>>
>> Maybe the test case just needs fixing? Christian suggested lazy/detach
>> umount and wait for sb release. There's an fsnotify hook for that,
>> fsnotify_sb_delete(). Obviously this is a bit more involved, but seems
>> to me that this would be the way to make it more reliable when killing
>> of tasks with open files are involved.
> 
> No, this is a real breakage. Any time we introduce unexpected
> asynchrony there's the potential for breakage: case in point, there was
> a filesystem that made rm asynchronous, then there were scripts out
> there that deleted until df showed under some threshold.. whoops...

This is nothing new - any fput done from an exiting task will end up
being deferred. The window may be a bit wider now or a bit different,
but it's the same window. If an application assumes it can kill && wait
on a task and be guaranteed that the files are released as soon as wait
returns, it is mistaken. That is NOT the case.

> this would break anyone that does fuser; umount; and making the umount
> lazy just moves the race to the next thing that uses the block device.
> 
> I'd like to know how delayed_fput() avoids this.

What is "this" here? The delayed fput list is processed async, so it's
really down to timing for the size of the window. Either the 388 test is
fixed so that it monitors for sb release like Christian described, or we
can paper around it with a sync and sleep or something like that. The
former would obviously be a lot more reliable.

Jens Axboe June 28, 2023, 8:45 p.m. UTC | #30

On 6/28/23 11:56?AM, Kent Overstreet wrote:
> On Wed, Jun 28, 2023 at 09:22:15AM -0600, Jens Axboe wrote:
>> On 6/28/23 8:58?AM, Jens Axboe wrote:
>>> I should have something later today, don't feel like I fully understand
>>> all of it just yet.
>>
>> Might indeed be delayed_fput, just the flush is a bit broken in that it
>> races with the worker doing the flush. In any case, with testing that, I
>> hit this before I got an umount failure on loop 6 of generic/388:
>>
>> fsck from util-linux 2.38.1
>> recovering from clean shutdown, journal seq 14642
>> journal read done, replaying entries 14642-14642
>> checking allocations
>> starting journal replay, 0 keys
>> checking need_discard and freespace btrees
>> checking lrus
>> checking backpointers to alloc keys
>> checking backpointers to extents
>> backpointer for missing extent
>>   u64s 9 type backpointer 0:7950303232:0 len 0 ver 0: bucket=0:15164:0 btree=extents l=0 offset=0:0 len=88 pos=1342182431:5745:U32_MAX, not fixing
> 
> Known bug, but it's gotten difficult to reproduce - if generic/388 ends
> up being a better reproducer for this that'll be nice

Seems to reproduce in anywhere from 1..4 iterations for me.

Jens Axboe June 28, 2023, 8:54 p.m. UTC | #31

On 6/28/23 11:54?AM, Kent Overstreet wrote:
> On Wed, Jun 28, 2023 at 08:58:45AM -0600, Jens Axboe wrote:
>> On 6/27/23 10:01?PM, Kent Overstreet wrote:
>>> On Tue, Jun 27, 2023 at 09:16:31PM -0600, Jens Axboe wrote:
>>>> On 6/27/23 2:15?PM, Kent Overstreet wrote:
>>>>>> to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept
>>>>>> failing because it assumed it was XFS.
>>>>>>
>>>>>> I suspected this was just a timing issue, and it looks like that's
>>>>>> exactly what it is. Looking at the test case, it'll randomly kill -9
>>>>>> fsstress, and if that happens while we have io_uring IO pending, then we
>>>>>> process completions inline (for a PF_EXITING current). This means they
>>>>>> get pushed to fallback work, which runs out of line. If we hit that case
>>>>>> AND the timing is such that it hasn't been processed yet, we'll still be
>>>>>> holding a file reference under the mount point and umount will -EBUSY
>>>>>> fail.
>>>>>>
>>>>>> As far as I can tell, this can happen with aio as well, it's just harder
>>>>>> to hit. If the fput happens while the task is exiting, then fput will
>>>>>> end up being delayed through a workqueue as well. The test case assumes
>>>>>> that once it's reaped the exit of the killed task that all files are
>>>>>> released, which isn't necessarily true if they are done out-of-line.
>>>>>
>>>>> Yeah, I traced it through to the delayed fput code as well.
>>>>>
>>>>> I'm not sure delayed fput is responsible here; what I learned when I was
>>>>> tracking this down has mostly fell out of my brain, so take anything I
>>>>> say with a large grain of salt. But I believe I tested with delayed_fput
>>>>> completely disabled, and found another thing in io_uring with the same
>>>>> effect as delayed_fput that wasn't being flushed.
>>>>
>>>> I'm not saying it's delayed_fput(), I'm saying it's the delayed putting
>>>> io_uring can end up doing. But yes, delayed_fput() is another candidate.
>>>
>>> Sorry - was just working through my recollections/initial thought
>>> process out loud
>>
>> No worries, it might actually be a combination and this is why my
>> io_uring side patch didn't fully resolve it. Wrote a simple reproducer
>> and it seems to reliably trigger it, but is fixed with an flush of the
>> delayed fput list on mount -EBUSY return. Still digging...
>>
>>>>>> For io_uring specifically, it may make sense to wait on the fallback
>>>>>> work. The below patch does this, and should fix the issue. But I'm not
>>>>>> fully convinced that this is really needed, as I do think this can
>>>>>> happen without io_uring as well. It just doesn't right now as the test
>>>>>> does buffered IO, and aio will be fully sync with buffered IO. That
>>>>>> means there's either no gap where aio will hit it without O_DIRECT, or
>>>>>> it's just small enough that it hasn't been hit.
>>>>>
>>>>> I just tried your patch and I still have generic/388 failing - it
>>>>> might've taken a bit longer to pop this time.
>>>>
>>>> Yep see the same here. Didn't have time to look into it after sending
>>>> that email today, just took a quick stab at writing a reproducer and
>>>> ended up crashing bcachefs:
>>>
>>> You must have hit an error before we finished initializing the
>>> filesystem, the list head never got initialized. Patch for that will be
>>> in the testing branch momentarily.
>>
>> I'll pull that in. In testing just now, I hit a few more leaks:
>>
>> unreferenced object 0xffff0000e55cf200 (size 128):
>>   comm "mount", pid 723, jiffies 4294899134 (age 85.868s)
>>   hex dump (first 32 bytes):
>>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>>   backtrace:
>>     [<000000001d69062c>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>>     [<00000000c503def2>] __kmem_cache_alloc_node+0xd0/0x178
>>     [<00000000cde48528>] __kmalloc+0xac/0xd4
>>     [<000000006cb9446a>] kmalloc_array.constprop.0+0x18/0x20
>>     [<000000008341b32c>] bch2_fs_alloc+0x73c/0xbcc
> 
> Can you faddr2line this? I just did a bunch of kmemleak testing and
> didn't see it.

0xffff800008589a20 is in bch2_fs_alloc (fs/bcachefs/super.c:813).
808		    !(c->online_reserved = alloc_percpu(u64)) ||
809		    !(c->btree_paths_bufs = alloc_percpu(struct btree_path_buf)) ||
810		    mempool_init_kvpmalloc_pool(&c->btree_bounce_pool, 1,
811						btree_bytes(c)) ||
812		    mempool_init_kmalloc_pool(&c->large_bkey_pool, 1, 2048) ||
813		    !(c->unused_inode_hints = kcalloc(1U << c->inode_shard_bits,
814						      sizeof(u64), GFP_KERNEL))) {
815			ret = -BCH_ERR_ENOMEM_fs_other_alloc;
816			goto err;
817		}

Jens Axboe June 28, 2023, 9:17 p.m. UTC | #32

On 6/28/23 2:44?PM, Jens Axboe wrote:
> On 6/28/23 11:52?AM, Kent Overstreet wrote:
>> On Wed, Jun 28, 2023 at 10:57:02AM -0600, Jens Axboe wrote:
>>> I discussed this with Christian offline. I have a patch that is pretty
>>> simple, but it does mean that you'd wait for delayed fput flush off
>>> umount. Which seems kind of iffy.
>>>
>>> I think we need to back up a bit and consider if the kill && umount
>>> really is sane. If you kill a task that has open files, then any fput
>>> from that task will end up being delayed. This means that the umount may
>>> very well fail.
>>>
>>> It'd be handy if we could have umount wait for that to finish, but I'm
>>> not at all confident this is a sane solution for all cases. And as
>>> discussed, we have no way to even identify which files we'd need to
>>> flush out of the delayed list.
>>>
>>> Maybe the test case just needs fixing? Christian suggested lazy/detach
>>> umount and wait for sb release. There's an fsnotify hook for that,
>>> fsnotify_sb_delete(). Obviously this is a bit more involved, but seems
>>> to me that this would be the way to make it more reliable when killing
>>> of tasks with open files are involved.
>>
>> No, this is a real breakage. Any time we introduce unexpected
>> asynchrony there's the potential for breakage: case in point, there was
>> a filesystem that made rm asynchronous, then there were scripts out
>> there that deleted until df showed under some threshold.. whoops...
> 
> This is nothing new - any fput done from an exiting task will end up
> being deferred. The window may be a bit wider now or a bit different,
> but it's the same window. If an application assumes it can kill && wait
> on a task and be guaranteed that the files are released as soon as wait
> returns, it is mistaken. That is NOT the case.

Case in point, just changed my reproducer to use aio instead of
io_uring. Here's the full script:

#!/bin/bash

DEV=/dev/nvme1n1
MNT=/data
ITER=0

while true; do
	echo loop $ITER
	sudo mount $DEV $MNT
	fio --name=test --ioengine=aio --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --output=/dev/null &
	Y=$(($RANDOM % 3))
	X=$(($RANDOM % 10))
	VAL="$Y.$X"
	sleep $VAL
	ps -e | grep fio > /dev/null 2>&1
	while [ $? -eq 0 ]; do
		killall -9 fio > /dev/null 2>&1
		echo will wait
		wait > /dev/null 2>&1
		echo done waiting
		ps -e | grep "fio " > /dev/null 2>&1
	done
	sudo umount /data
	if [ $? -ne 0 ]; then
		break
	fi
	((ITER++))
done

and if I run that, fails on the first umount attempt in that loop:

axboe@m1max-kvm ~> bash test2.sh
loop 0
will wait
done waiting
umount: /data: target is busy.

So yeah, this is _nothing_ new. I really don't think trying to address
this in the kernel is the right approach, it'd be a lot saner to harden
the xfstest side to deal with the umount a bit more sanely. There are
obviously tons of other ways that a mount could get pinned, which isn't
too relevant here since the bdev and mount point are basically exclusive
to the test being run. But the kill and delayed fput is enough to make
that case imho.

Kent Overstreet June 28, 2023, 10:13 p.m. UTC | #33

On Wed, Jun 28, 2023 at 03:17:43PM -0600, Jens Axboe wrote:
> Case in point, just changed my reproducer to use aio instead of
> io_uring. Here's the full script:
> 
> #!/bin/bash
> 
> DEV=/dev/nvme1n1
> MNT=/data
> ITER=0
> 
> while true; do
> 	echo loop $ITER
> 	sudo mount $DEV $MNT
> 	fio --name=test --ioengine=aio --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --output=/dev/null &
> 	Y=$(($RANDOM % 3))
> 	X=$(($RANDOM % 10))
> 	VAL="$Y.$X"
> 	sleep $VAL
> 	ps -e | grep fio > /dev/null 2>&1
> 	while [ $? -eq 0 ]; do
> 		killall -9 fio > /dev/null 2>&1
> 		echo will wait
> 		wait > /dev/null 2>&1
> 		echo done waiting
> 		ps -e | grep "fio " > /dev/null 2>&1
> 	done
> 	sudo umount /data
> 	if [ $? -ne 0 ]; then
> 		break
> 	fi
> 	((ITER++))
> done
> 
> and if I run that, fails on the first umount attempt in that loop:
> 
> axboe@m1max-kvm ~> bash test2.sh
> loop 0
> will wait
> done waiting
> umount: /data: target is busy.
> 
> So yeah, this is _nothing_ new. I really don't think trying to address
> this in the kernel is the right approach, it'd be a lot saner to harden
> the xfstest side to deal with the umount a bit more sanely. There are
> obviously tons of other ways that a mount could get pinned, which isn't
> too relevant here since the bdev and mount point are basically exclusive
> to the test being run. But the kill and delayed fput is enough to make
> that case imho.

Uh, count me very much not in favor of hacking around bugs elsewhere.

Al, do you know if this has been considered before? We've got fput()
being called from aio completion, which often runs out of a worqueue (if
not a workqueue, a bottom half of some sort - what happens then, I
wonder) - so the effect is that it goes on the global list, not the task
work list.

hence, kill -9ing a process doing aio (or io_uring io, for extra
reasons) causes umount to fail with -EBUSY.

and since there's no mechanism for userspace to deal with this besides
sleep and retry, this seems pretty gross.

I'd be willing to tackle this for aio since I know that code...

Jens Axboe June 28, 2023, 10:14 p.m. UTC | #34

On 6/28/23 2:54?PM, Jens Axboe wrote:
> On 6/28/23 11:54?AM, Kent Overstreet wrote:
>> On Wed, Jun 28, 2023 at 08:58:45AM -0600, Jens Axboe wrote:
>>> On 6/27/23 10:01?PM, Kent Overstreet wrote:
>>>> On Tue, Jun 27, 2023 at 09:16:31PM -0600, Jens Axboe wrote:
>>>>> On 6/27/23 2:15?PM, Kent Overstreet wrote:
>>>>>>> to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept
>>>>>>> failing because it assumed it was XFS.
>>>>>>>
>>>>>>> I suspected this was just a timing issue, and it looks like that's
>>>>>>> exactly what it is. Looking at the test case, it'll randomly kill -9
>>>>>>> fsstress, and if that happens while we have io_uring IO pending, then we
>>>>>>> process completions inline (for a PF_EXITING current). This means they
>>>>>>> get pushed to fallback work, which runs out of line. If we hit that case
>>>>>>> AND the timing is such that it hasn't been processed yet, we'll still be
>>>>>>> holding a file reference under the mount point and umount will -EBUSY
>>>>>>> fail.
>>>>>>>
>>>>>>> As far as I can tell, this can happen with aio as well, it's just harder
>>>>>>> to hit. If the fput happens while the task is exiting, then fput will
>>>>>>> end up being delayed through a workqueue as well. The test case assumes
>>>>>>> that once it's reaped the exit of the killed task that all files are
>>>>>>> released, which isn't necessarily true if they are done out-of-line.
>>>>>>
>>>>>> Yeah, I traced it through to the delayed fput code as well.
>>>>>>
>>>>>> I'm not sure delayed fput is responsible here; what I learned when I was
>>>>>> tracking this down has mostly fell out of my brain, so take anything I
>>>>>> say with a large grain of salt. But I believe I tested with delayed_fput
>>>>>> completely disabled, and found another thing in io_uring with the same
>>>>>> effect as delayed_fput that wasn't being flushed.
>>>>>
>>>>> I'm not saying it's delayed_fput(), I'm saying it's the delayed putting
>>>>> io_uring can end up doing. But yes, delayed_fput() is another candidate.
>>>>
>>>> Sorry - was just working through my recollections/initial thought
>>>> process out loud
>>>
>>> No worries, it might actually be a combination and this is why my
>>> io_uring side patch didn't fully resolve it. Wrote a simple reproducer
>>> and it seems to reliably trigger it, but is fixed with an flush of the
>>> delayed fput list on mount -EBUSY return. Still digging...
>>>
>>>>>>> For io_uring specifically, it may make sense to wait on the fallback
>>>>>>> work. The below patch does this, and should fix the issue. But I'm not
>>>>>>> fully convinced that this is really needed, as I do think this can
>>>>>>> happen without io_uring as well. It just doesn't right now as the test
>>>>>>> does buffered IO, and aio will be fully sync with buffered IO. That
>>>>>>> means there's either no gap where aio will hit it without O_DIRECT, or
>>>>>>> it's just small enough that it hasn't been hit.
>>>>>>
>>>>>> I just tried your patch and I still have generic/388 failing - it
>>>>>> might've taken a bit longer to pop this time.
>>>>>
>>>>> Yep see the same here. Didn't have time to look into it after sending
>>>>> that email today, just took a quick stab at writing a reproducer and
>>>>> ended up crashing bcachefs:
>>>>
>>>> You must have hit an error before we finished initializing the
>>>> filesystem, the list head never got initialized. Patch for that will be
>>>> in the testing branch momentarily.
>>>
>>> I'll pull that in. In testing just now, I hit a few more leaks:
>>>
>>> unreferenced object 0xffff0000e55cf200 (size 128):
>>>   comm "mount", pid 723, jiffies 4294899134 (age 85.868s)
>>>   hex dump (first 32 bytes):
>>>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>>>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>>>   backtrace:
>>>     [<000000001d69062c>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>>>     [<00000000c503def2>] __kmem_cache_alloc_node+0xd0/0x178
>>>     [<00000000cde48528>] __kmalloc+0xac/0xd4
>>>     [<000000006cb9446a>] kmalloc_array.constprop.0+0x18/0x20
>>>     [<000000008341b32c>] bch2_fs_alloc+0x73c/0xbcc
>>
>> Can you faddr2line this? I just did a bunch of kmemleak testing and
>> didn't see it.
> 
> 0xffff800008589a20 is in bch2_fs_alloc (fs/bcachefs/super.c:813).
> 808		    !(c->online_reserved = alloc_percpu(u64)) ||
> 809		    !(c->btree_paths_bufs = alloc_percpu(struct btree_path_buf)) ||
> 810		    mempool_init_kvpmalloc_pool(&c->btree_bounce_pool, 1,
> 811						btree_bytes(c)) ||
> 812		    mempool_init_kmalloc_pool(&c->large_bkey_pool, 1, 2048) ||
> 813		    !(c->unused_inode_hints = kcalloc(1U << c->inode_shard_bits,
> 814						      sizeof(u64), GFP_KERNEL))) {
> 815			ret = -BCH_ERR_ENOMEM_fs_other_alloc;
> 816			goto err;
> 817		}

Got a whole bunch more running that aio reproducer I sent earlier. I'm
sure a lot of these are dupes, sending them here for completeness.

[  677.739815] kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
[ 1283.963249] kmemleak: 37 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

unreferenced object 0xffff0000e35de000 (size 8192):
  comm "mount", pid 3049, jiffies 4294924385 (age 3938.092s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    1d 00 1d 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000005602d414>] __kmalloc_node_track_caller+0xa8/0xd0
    [<0000000078a13296>] krealloc+0x7c/0xc4
    [<00000000f1fea4ad>] bch2_sb_realloc+0x12c/0x150
    [<00000000f03d5ce6>] __copy_super+0x104/0x17c
    [<000000005567521f>] bch2_sb_to_fs+0x3c/0x80
    [<0000000062d4e9f6>] bch2_fs_alloc+0x410/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
unreferenced object 0xffff00020a209900 (size 128):
  comm "mount", pid 3049, jiffies 4294924385 (age 3938.092s)
  hex dump (first 32 bytes):
    03 01 01 00 02 01 01 00 04 01 01 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<00000000cd9515c0>] __kmalloc+0xac/0xd4
    [<00000000fcf82258>] kmalloc_array.constprop.0+0x18/0x20
    [<00000000182c3be4>] __bch2_sb_replicas_v0_to_cpu_replicas+0x50/0x118
    [<0000000012583a94>] bch2_sb_replicas_to_cpu_replicas+0xb0/0xc0
    [<00000000fcd0b373>] bch2_sb_to_fs+0x4c/0x80
    [<0000000062d4e9f6>] bch2_fs_alloc+0x410/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
unreferenced object 0xffff000206785400 (size 128):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.068s)
  hex dump (first 32 bytes):
    00 00 d9 20 02 00 ff ff 01 00 00 00 01 04 00 00  ... ............
    01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000007b360995>] __kmalloc_node+0xac/0xd4
    [<0000000050ae8904>] mempool_init_node+0x64/0xd8
    [<00000000e714c59a>] mempool_init+0x14/0x1c
    [<00000000bb95f8a0>] bch2_fs_alloc+0x690/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
    [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
    [<00000000b4ee996a>] el0_svc+0x74/0x9c
unreferenced object 0xffff000206785700 (size 128):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.076s)
  hex dump (first 32 bytes):
    00 00 96 2d 02 00 ff ff 01 00 00 00 01 04 00 00  ...-............
    01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000007b360995>] __kmalloc_node+0xac/0xd4
    [<0000000050ae8904>] mempool_init_node+0x64/0xd8
    [<00000000e714c59a>] mempool_init+0x14/0x1c
    [<0000000089ab54c3>] bch2_fs_replicas_init+0x64/0xac
    [<0000000056c4a5fe>] bch2_fs_alloc+0x79c/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
    [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
unreferenced object 0xffff000206785600 (size 128):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.076s)
  hex dump (first 32 bytes):
    00 1a 05 00 00 00 00 00 00 0c 02 00 00 00 00 00  ................
    42 9c ba 00 00 00 00 00 00 00 00 00 00 00 00 00  B...............
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<00000000cd9515c0>] __kmalloc+0xac/0xd4
    [<00000000f949dcc7>] replicas_table_update+0x84/0x214
    [<000000002debc89d>] bch2_fs_replicas_init+0x74/0xac
    [<0000000056c4a5fe>] bch2_fs_alloc+0x79c/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
    [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
    [<00000000b4ee996a>] el0_svc+0x74/0x9c
unreferenced object 0xffff000206785580 (size 128):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.076s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 01 00 00 00 01 04 00 00  ................
    01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<00000000cd9515c0>] __kmalloc+0xac/0xd4
    [<00000000639b7f33>] replicas_table_update+0x98/0x214
    [<000000002debc89d>] bch2_fs_replicas_init+0x74/0xac
    [<0000000056c4a5fe>] bch2_fs_alloc+0x79c/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
    [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
    [<00000000b4ee996a>] el0_svc+0x74/0x9c
unreferenced object 0xffff000206785080 (size 128):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.088s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<00000000cd9515c0>] __kmalloc+0xac/0xd4
    [<000000001335974a>] __prealloc_shrinker+0x3c/0x60
    [<0000000017b0bc26>] register_shrinker+0x14/0x34
    [<00000000c07d01d7>] bch2_fs_btree_cache_init+0xf8/0x150
    [<000000004b948640>] bch2_fs_alloc+0x7ac/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
    [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
unreferenced object 0xffff000200f2ec00 (size 1024):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.088s)
  hex dump (first 32 bytes):
    40 00 00 00 00 00 00 00 a8 66 18 09 00 00 00 00  @........f......
    10 ec f2 00 02 00 ff ff 10 ec f2 00 02 00 ff ff  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000007b360995>] __kmalloc_node+0xac/0xd4
    [<0000000066405974>] kvmalloc_node+0x54/0xe4
    [<00000000a51f16c9>] bucket_table_alloc.isra.0+0x44/0x120
    [<0000000000df2e94>] rhashtable_init+0x148/0x1ac
    [<0000000080f397f7>] bch2_fs_btree_key_cache_init+0x48/0x90
    [<0000000089e6783c>] bch2_fs_alloc+0x7c0/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
unreferenced object 0xffff000206785b80 (size 128):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.088s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<00000000cd9515c0>] __kmalloc+0xac/0xd4
    [<000000001335974a>] __prealloc_shrinker+0x3c/0x60
    [<0000000017b0bc26>] register_shrinker+0x14/0x34
    [<00000000228dd43a>] bch2_fs_btree_key_cache_init+0x88/0x90
    [<0000000089e6783c>] bch2_fs_alloc+0x7c0/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
    [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
unreferenced object 0xffff000206785500 (size 128):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.096s)
  hex dump (first 32 bytes):
    00 00 20 2b 02 00 ff ff 01 00 00 00 01 04 00 00  .. +............
    01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000007b360995>] __kmalloc_node+0xac/0xd4
    [<0000000050ae8904>] mempool_init_node+0x64/0xd8
    [<00000000e714c59a>] mempool_init+0x14/0x1c
    [<00000000fc134979>] bch2_fs_btree_iter_init+0x98/0x130
    [<00000000addf57f5>] bch2_fs_alloc+0x7d0/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
    [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
unreferenced object 0xffff000206785480 (size 128):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.096s)
  hex dump (first 32 bytes):
    00 00 97 05 02 00 ff ff 01 00 00 00 01 04 00 00  ................
    01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000007b360995>] __kmalloc_node+0xac/0xd4
    [<0000000050ae8904>] mempool_init_node+0x64/0xd8
    [<00000000e714c59a>] mempool_init+0x14/0x1c
    [<000000004d03e2b7>] bch2_fs_btree_iter_init+0xb8/0x130
    [<00000000addf57f5>] bch2_fs_alloc+0x7d0/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
    [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
unreferenced object 0xffff000230a31a00 (size 512):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.096s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000009502ae7b>] kmalloc_trace+0x38/0x78
    [<0000000060cbc45a>] init_srcu_struct_fields+0x38/0x284
    [<00000000643a7c95>] init_srcu_struct+0x10/0x18
    [<00000000c46c2041>] bch2_fs_btree_iter_init+0xc8/0x130
    [<00000000addf57f5>] bch2_fs_alloc+0x7d0/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
    [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
unreferenced object 0xffff000222f14f00 (size 256):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.100s)
  hex dump (first 32 bytes):
    03 00 00 00 01 00 ff ff 1a cf e0 f2 17 b2 a8 24  ...............$
    cf 4a ba c3 fb 05 19 cd f6 4d f5 45 e7 e8 29 eb  .J.......M.E..).
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000007b360995>] __kmalloc_node+0xac/0xd4
    [<0000000066405974>] kvmalloc_node+0x54/0xe4
    [<00000000c83b22ef>] bch2_fs_buckets_waiting_for_journal_init+0x44/0x6c
    [<0000000026230712>] bch2_fs_alloc+0x7f0/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
    [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
    [<00000000b4ee996a>] el0_svc+0x74/0x9c
unreferenced object 0xffff000230e00000 (size 720896):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.100s)
  hex dump (first 32 bytes):
    40 71 26 38 00 00 00 00 00 00 00 00 00 00 00 00  @q&8............
    d2 17 f9 2f 75 7e 51 2a 01 01 00 00 14 00 00 00  .../u~Q*........
  backtrace:
    [<00000000c6d9e620>] __kmalloc_large_node+0x134/0x164
    [<000000009024f86b>] __kmalloc_node+0x34/0xd4
    [<0000000066405974>] kvmalloc_node+0x54/0xe4
    [<00000000729eb36b>] bch2_fs_btree_write_buffer_init+0x58/0xb4
    [<000000003e35ba10>] bch2_fs_alloc+0x800/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
    [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
    [<00000000b4ee996a>] el0_svc+0x74/0x9c
    [<00000000a22b66b5>] el0t_64_sync_handler+0xa8/0x134
unreferenced object 0xffff000230900000 (size 720896):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.100s)
  hex dump (first 32 bytes):
    88 96 28 f7 00 00 00 00 00 00 00 00 00 00 00 00  ..(.............
    d2 17 f9 2f 75 7e 51 2a 01 01 00 00 13 00 00 00  .../u~Q*........
  backtrace:
    [<00000000c6d9e620>] __kmalloc_large_node+0x134/0x164
    [<000000009024f86b>] __kmalloc_node+0x34/0xd4
    [<0000000066405974>] kvmalloc_node+0x54/0xe4
    [<00000000f27707f5>] bch2_fs_btree_write_buffer_init+0x7c/0xb4
    [<000000003e35ba10>] bch2_fs_alloc+0x800/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
    [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
    [<00000000b4ee996a>] el0_svc+0x74/0x9c
    [<00000000a22b66b5>] el0t_64_sync_handler+0xa8/0x134
unreferenced object 0xffff0000c8d1e300 (size 128):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.108s)
  hex dump (first 32 bytes):
    00 c0 0a 02 02 00 ff ff 00 80 5a 20 00 80 ff ff  ..........Z ....
    00 50 00 00 00 00 00 00 02 00 00 00 00 00 00 00  .P..............
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000007b360995>] __kmalloc_node+0xac/0xd4
    [<0000000050ae8904>] mempool_init_node+0x64/0xd8
    [<00000000e714c59a>] mempool_init+0x14/0x1c
    [<000000001719fe70>] bioset_init+0x188/0x22c
    [<000000004a1ea042>] bch2_fs_io_init+0x2c/0x124
    [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
unreferenced object 0xffff0002020ac000 (size 448):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.108s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 c8 02 00 02 02 00 ff ff  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c
    [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c
    [<000000002d6118f3>] mempool_init_node+0x94/0xd8
    [<00000000e714c59a>] mempool_init+0x14/0x1c
    [<000000001719fe70>] bioset_init+0x188/0x22c
    [<000000004a1ea042>] bch2_fs_io_init+0x2c/0x124
    [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
unreferenced object 0xffff0000c8d1e500 (size 128):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.108s)
  hex dump (first 32 bytes):
    00 40 e4 c9 00 00 ff ff c0 d9 9a 03 00 fc ff ff  .@..............
    80 d9 9a 03 00 fc ff ff 40 d9 9a 03 00 fc ff ff  ........@.......
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000007b360995>] __kmalloc_node+0xac/0xd4
    [<0000000050ae8904>] mempool_init_node+0x64/0xd8
    [<00000000e714c59a>] mempool_init+0x14/0x1c
    [<000000002f5588b4>] biovec_init_pool+0x24/0x2c
    [<00000000a2b87494>] bioset_init+0x208/0x22c
    [<000000004a1ea042>] bch2_fs_io_init+0x2c/0x124
    [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
unreferenced object 0xffff0000c8d1e900 (size 128):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.116s)
  hex dump (first 32 bytes):
    00 c2 0a 02 02 00 ff ff 00 00 58 20 00 80 ff ff  ..........X ....
    00 50 00 00 00 00 00 00 02 00 00 00 00 00 00 00  .P..............
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000007b360995>] __kmalloc_node+0xac/0xd4
    [<0000000050ae8904>] mempool_init_node+0x64/0xd8
    [<00000000e714c59a>] mempool_init+0x14/0x1c
    [<000000001719fe70>] bioset_init+0x188/0x22c
    [<000000007af2eb34>] bch2_fs_io_init+0x48/0x124
    [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
unreferenced object 0xffff0002020ac200 (size 448):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.116s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 0d 69 37 bf  .............i7.
    00 00 00 00 00 00 00 00 c8 c2 0a 02 02 00 ff ff  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c
    [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c
    [<000000002d6118f3>] mempool_init_node+0x94/0xd8
    [<00000000e714c59a>] mempool_init+0x14/0x1c
    [<000000001719fe70>] bioset_init+0x188/0x22c
    [<000000007af2eb34>] bch2_fs_io_init+0x48/0x124
    [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
unreferenced object 0xffff0000c8d1e980 (size 128):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.116s)
  hex dump (first 32 bytes):
    00 50 e4 c9 00 00 ff ff c0 dc 9a 03 00 fc ff ff  .P..............
    80 dc 9a 03 00 fc ff ff 40 44 23 03 00 fc ff ff  ........@D#.....
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000007b360995>] __kmalloc_node+0xac/0xd4
    [<0000000050ae8904>] mempool_init_node+0x64/0xd8
    [<00000000e714c59a>] mempool_init+0x14/0x1c
    [<000000002f5588b4>] biovec_init_pool+0x24/0x2c
    [<00000000a2b87494>] bioset_init+0x208/0x22c
    [<000000007af2eb34>] bch2_fs_io_init+0x48/0x124
    [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
unreferenced object 0xffff000230a31e00 (size 512):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.120s)
  hex dump (first 32 bytes):
    00 9f 39 08 00 fc ff ff 40 f9 99 08 00 fc ff ff  ..9.....@.......
    40 c0 8b 08 00 fc ff ff 00 d8 14 08 00 fc ff ff  @...............
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000007b360995>] __kmalloc_node+0xac/0xd4
    [<0000000050ae8904>] mempool_init_node+0x64/0xd8
    [<00000000e714c59a>] mempool_init+0x14/0x1c
    [<000000009f58f780>] bch2_fs_io_init+0x9c/0x124
    [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
    [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
unreferenced object 0xffff000200f2e800 (size 1024):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.120s)
  hex dump (first 32 bytes):
    40 00 00 00 00 00 00 00 89 16 1e cd 00 00 00 00  @...............
    10 e8 f2 00 02 00 ff ff 10 e8 f2 00 02 00 ff ff  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000007b360995>] __kmalloc_node+0xac/0xd4
    [<0000000066405974>] kvmalloc_node+0x54/0xe4
    [<00000000a51f16c9>] bucket_table_alloc.isra.0+0x44/0x120
    [<0000000000df2e94>] rhashtable_init+0x148/0x1ac
    [<00000000347789c6>] bch2_fs_io_init+0xb8/0x124
    [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
unreferenced object 0xffff000222f14700 (size 256):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.120s)
  hex dump (first 32 bytes):
    68 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  h...............
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000007b360995>] __kmalloc_node+0xac/0xd4
    [<000000003a6af69a>] crypto_alloc_tfmmem+0x3c/0x70
    [<000000006c0841c0>] crypto_create_tfm_node+0x20/0xa0
    [<00000000b0aa6a0f>] crypto_alloc_tfm_node+0x94/0xac
    [<00000000a2421d04>] crypto_alloc_shash+0x20/0x28
    [<00000000aeafee8e>] bch2_fs_encryption_init+0x64/0x150
    [<0000000002e060b3>] bch2_fs_alloc+0x840/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
unreferenced object 0xffff00020e544100 (size 128):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.128s)
  hex dump (first 32 bytes):
    00 00 f0 20 02 00 ff ff 80 04 f0 20 02 00 ff ff  ... ....... ....
    00 09 f0 20 02 00 ff ff 80 0d f0 20 02 00 ff ff  ... ....... ....
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000007b360995>] __kmalloc_node+0xac/0xd4
    [<0000000050ae8904>] mempool_init_node+0x64/0xd8
    [<00000000e714c59a>] mempool_init+0x14/0x1c
    [<000000001719fe70>] bioset_init+0x188/0x22c
    [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130
    [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
unreferenced object 0xffff000220f00000 (size 1104):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.128s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 98 e7 87 30 02 00 ff ff  ...........0....
    22 01 00 00 00 00 ad de 18 00 f0 20 02 00 ff ff  ".......... ....
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c
    [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c
    [<000000002d6118f3>] mempool_init_node+0x94/0xd8
    [<00000000e714c59a>] mempool_init+0x14/0x1c
    [<000000001719fe70>] bioset_init+0x188/0x22c
    [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130
    [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
unreferenced object 0xffff000220f00480 (size 1104):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.128s)
  hex dump (first 32 bytes):
    80 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    01 00 00 00 00 00 00 00 00 00 00 00 9d a2 98 2c  ...............,
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c
    [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c
    [<000000002d6118f3>] mempool_init_node+0x94/0xd8
    [<00000000e714c59a>] mempool_init+0x14/0x1c
    [<000000001719fe70>] bioset_init+0x188/0x22c
    [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130
    [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
unreferenced object 0xffff000220f00900 (size 1104):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.132s)
  hex dump (first 32 bytes):
    22 01 00 00 00 00 ad de 08 09 f0 20 02 00 ff ff  ".......... ....
    08 09 f0 20 02 00 ff ff b9 17 f0 20 02 00 ff ff  ... ....... ....
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c
    [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c
    [<000000002d6118f3>] mempool_init_node+0x94/0xd8
    [<00000000e714c59a>] mempool_init+0x14/0x1c
    [<000000001719fe70>] bioset_init+0x188/0x22c
    [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130
    [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
unreferenced object 0xffff000220f00d80 (size 1104):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.132s)
  hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 00 00 00 00 34 43 3f b1  ............4C?.
    00 00 00 00 00 00 00 00 24 00 bb 04 02 28 1e 3b  ........$....(.;
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c
    [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c
    [<000000002d6118f3>] mempool_init_node+0x94/0xd8
    [<00000000e714c59a>] mempool_init+0x14/0x1c
    [<000000001719fe70>] bioset_init+0x188/0x22c
    [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130
    [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
unreferenced object 0xffff00020e544000 (size 128):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.132s)
  hex dump (first 32 bytes):
    00 00 b5 05 02 00 ff ff 00 50 b5 05 02 00 ff ff  .........P......
    00 10 b5 05 02 00 ff ff 00 b0 18 09 02 00 ff ff  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000007b360995>] __kmalloc_node+0xac/0xd4
    [<0000000050ae8904>] mempool_init_node+0x64/0xd8
    [<00000000e714c59a>] mempool_init+0x14/0x1c
    [<000000002f5588b4>] biovec_init_pool+0x24/0x2c
    [<00000000a2b87494>] bioset_init+0x208/0x22c
    [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130
    [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
unreferenced object 0xffff00020918b000 (size 4096):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.140s)
  hex dump (first 32 bytes):
    c0 5b 26 03 00 fc ff ff 00 10 00 00 00 00 00 00  .[&.............
    22 01 00 00 00 00 ad de 18 b0 18 09 02 00 ff ff  "...............
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c
    [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c
    [<000000002d6118f3>] mempool_init_node+0x94/0xd8
    [<00000000e714c59a>] mempool_init+0x14/0x1c
    [<000000002f5588b4>] biovec_init_pool+0x24/0x2c
    [<00000000a2b87494>] bioset_init+0x208/0x22c
    [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130
    [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc
    [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
unreferenced object 0xffff000200f2f400 (size 1024):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.140s)
  hex dump (first 32 bytes):
    19 00 00 00 00 00 00 00 9d 19 00 00 00 00 00 00  ................
    a6 19 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<00000000cd9515c0>] __kmalloc+0xac/0xd4
    [<0000000097d0e280>] bch2_blacklist_table_initialize+0x48/0xc4
    [<000000007af2f7c0>] bch2_fs_recovery+0x220/0x140c
    [<00000000835fe5c8>] bch2_fs_start+0x104/0x2ac
    [<00000000f2c8e79f>] bch2_fs_open+0x2cc/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
    [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
    [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
    [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
    [<00000000b4ee996a>] el0_svc+0x74/0x9c
unreferenced object 0xffff00020e544800 (size 128):
  comm "mount", pid 3049, jiffies 4294924391 (age 3938.140s)
  hex dump (first 32 bytes):
    07 00 00 01 09 00 00 00 73 74 61 72 74 69 6e 67  ........starting
    20 6a 6f 75 72 6e 61 6c 20 61 74 20 65 6e 74 72   journal at entr
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000005602d414>] __kmalloc_node_track_caller+0xa8/0xd0
    [<0000000078a13296>] krealloc+0x7c/0xc4
    [<00000000224f82f4>] __darray_make_room.constprop.0+0x5c/0x7c
    [<00000000caa2f6f2>] __bch2_trans_log_msg+0x80/0x12c
    [<0000000034a8dfea>] __bch2_fs_log_msg+0x68/0x158
    [<00000000cc0719ad>] bch2_journal_log_msg+0x60/0x98
    [<00000000a0b3d87b>] bch2_fs_recovery+0x8f0/0x140c
    [<00000000835fe5c8>] bch2_fs_start+0x104/0x2ac
    [<00000000f2c8e79f>] bch2_fs_open+0x2cc/0x430
    [<00000000e72d508e>] bch2_mount+0x194/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
unreferenced object 0xffff00020f2d8398 (size 184):
  comm "mount", pid 3049, jiffies 4294924395 (age 3938.128s)
  hex dump (first 32 bytes):
    00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c
    [<0000000059ea6346>] bch2_btree_path_traverse_cached_slowpath+0x240/0x9d0
    [<00000000b340fce9>] bch2_btree_path_traverse_cached+0x7c/0x184
    [<000000006b501914>] bch2_btree_path_traverse_one+0xbc/0x4f0
    [<0000000046611bb8>] bch2_btree_path_traverse+0x20/0x30
    [<00000000cb7378ca>] bch2_btree_iter_peek_slot+0xe4/0x3b0
    [<000000005b36d96f>] __bch2_bkey_get_iter.constprop.0+0x40/0x74
    [<00000000db6c00c7>] bch2_inode_peek+0x80/0xfc
    [<00000000d48fafeb>] bch2_inode_find_by_inum_trans+0x34/0x74
    [<00000000d94a8ca3>] bch2_vfs_inode_get+0xdc/0x1a0
    [<00000000b7cffdf2>] bch2_mount+0x3bc/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
    [<00000000527b4561>] path_mount+0x5d0/0x6c8
    [<00000000dc643d96>] do_mount+0x80/0xa4
unreferenced object 0xffff000222f15e00 (size 256):
  comm "mount", pid 3049, jiffies 4294924395 (age 3938.128s)
  hex dump (first 32 bytes):
    12 81 1d 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 ff ff ff ff 00 10 00 00 00 00 00 00  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<00000000cd9515c0>] __kmalloc+0xac/0xd4
    [<0000000080dcf5d4>] btree_key_cache_fill+0x190/0x338
    [<00000000142a161b>] bch2_btree_path_traverse_cached_slowpath+0x8d8/0x9d0
    [<00000000b340fce9>] bch2_btree_path_traverse_cached+0x7c/0x184
    [<000000006b501914>] bch2_btree_path_traverse_one+0xbc/0x4f0
    [<0000000046611bb8>] bch2_btree_path_traverse+0x20/0x30
    [<00000000cb7378ca>] bch2_btree_iter_peek_slot+0xe4/0x3b0
    [<000000005b36d96f>] __bch2_bkey_get_iter.constprop.0+0x40/0x74
    [<00000000db6c00c7>] bch2_inode_peek+0x80/0xfc
    [<00000000d48fafeb>] bch2_inode_find_by_inum_trans+0x34/0x74
    [<00000000d94a8ca3>] bch2_vfs_inode_get+0xdc/0x1a0
    [<00000000b7cffdf2>] bch2_mount+0x3bc/0x45c
    [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
    [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
unreferenced object 0xffff00020ab0da80 (size 128):
  comm "fio", pid 3068, jiffies 4294924399 (age 3938.112s)
  hex dump (first 32 bytes):
    70 61 74 68 3a 20 69 64 78 20 20 30 20 72 65 66  path: idx  0 ref
    20 30 3a 30 20 50 20 53 20 62 74 72 65 65 3d 73   0:0 P S btree=s
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000005602d414>] __kmalloc_node_track_caller+0xa8/0xd0
    [<0000000078a13296>] krealloc+0x7c/0xc4
    [<00000000ac6de278>] bch2_printbuf_make_room+0x6c/0x9c
    [<00000000e73dab89>] bch2_prt_printf+0xac/0x104
    [<00000000ef2c8dc5>] bch2_btree_path_to_text+0x6c/0xb8
    [<00000000eab3e43c>] __bch2_trans_paths_to_text+0x60/0x64
    [<00000000d843d03a>] bch2_trans_paths_to_text+0x10/0x18
    [<00000000fbe77c9c>] bch2_trans_update_max_paths+0x6c/0x104
    [<00000000715f184d>] btree_path_alloc+0x44/0x140
    [<0000000028aac82e>] bch2_path_get+0x190/0x210
    [<000000001fbd1416>] bch2_trans_iter_init_outlined+0xd4/0x100
    [<00000000b7c2c8e8>] bch2_trans_iter_init.constprop.0+0x28/0x30
    [<000000005ee45b0d>] __bch2_dirent_lookup_trans+0xc4/0x20c
    [<00000000bf9849b2>] bch2_dirent_lookup+0x9c/0x10c
unreferenced object 0xffff00020f2d8450 (size 184):
  comm "fio", pid 3068, jiffies 4294924400 (age 3938.112s)
  hex dump (first 32 bytes):
    00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c
    [<0000000059ea6346>] bch2_btree_path_traverse_cached_slowpath+0x240/0x9d0
    [<00000000b340fce9>] bch2_btree_path_traverse_cached+0x7c/0x184
    [<000000006b501914>] bch2_btree_path_traverse_one+0xbc/0x4f0
    [<0000000046611bb8>] bch2_btree_path_traverse+0x20/0x30
    [<00000000cb7378ca>] bch2_btree_iter_peek_slot+0xe4/0x3b0
    [<000000005b36d96f>] __bch2_bkey_get_iter.constprop.0+0x40/0x74
    [<00000000db6c00c7>] bch2_inode_peek+0x80/0xfc
    [<00000000d48fafeb>] bch2_inode_find_by_inum_trans+0x34/0x74
    [<00000000d94a8ca3>] bch2_vfs_inode_get+0xdc/0x1a0
    [<00000000225a6085>] bch2_lookup+0x7c/0xb8
    [<0000000059304a98>] __lookup_slow+0xd4/0x114
    [<000000001225c82d>] walk_component+0x98/0xd4
    [<0000000095114e46>] path_lookupat+0x84/0x114
    [<000000002ee74fa2>] filename_lookup+0x54/0xc4
unreferenced object 0xffff000222f15a00 (size 256):
  comm "fio", pid 3068, jiffies 4294924400 (age 3938.112s)
  hex dump (first 32 bytes):
    13 81 1d 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 ff ff ff ff f3 15 00 30 00 00 00 00  ...........0....
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<00000000cd9515c0>] __kmalloc+0xac/0xd4
    [<0000000080dcf5d4>] btree_key_cache_fill+0x190/0x338
    [<00000000142a161b>] bch2_btree_path_traverse_cached_slowpath+0x8d8/0x9d0
    [<00000000b340fce9>] bch2_btree_path_traverse_cached+0x7c/0x184
    [<000000006b501914>] bch2_btree_path_traverse_one+0xbc/0x4f0
    [<0000000046611bb8>] bch2_btree_path_traverse+0x20/0x30
    [<00000000cb7378ca>] bch2_btree_iter_peek_slot+0xe4/0x3b0
    [<000000005b36d96f>] __bch2_bkey_get_iter.constprop.0+0x40/0x74
    [<00000000db6c00c7>] bch2_inode_peek+0x80/0xfc
    [<00000000d48fafeb>] bch2_inode_find_by_inum_trans+0x34/0x74
    [<00000000d94a8ca3>] bch2_vfs_inode_get+0xdc/0x1a0
    [<00000000225a6085>] bch2_lookup+0x7c/0xb8
    [<0000000059304a98>] __lookup_slow+0xd4/0x114
    [<000000001225c82d>] walk_component+0x98/0xd4
unreferenced object 0xffff0002058e0e00 (size 256):
  comm "fio", pid 3081, jiffies 4294924461 (age 3937.868s)
  hex dump (first 32 bytes):
    70 61 74 68 3a 20 69 64 78 20 20 31 20 72 65 66  path: idx  1 ref
    20 30 3a 30 20 50 20 20 20 62 74 72 65 65 3d 65   0:0 P   btree=e
  backtrace:
    [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
    [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
    [<000000005602d414>] __kmalloc_node_track_caller+0xa8/0xd0
    [<0000000078a13296>] krealloc+0x7c/0xc4
    [<00000000ac6de278>] bch2_printbuf_make_room+0x6c/0x9c
    [<00000000e73dab89>] bch2_prt_printf+0xac/0x104
    [<00000000ef2c8dc5>] bch2_btree_path_to_text+0x6c/0xb8
    [<00000000eab3e43c>] __bch2_trans_paths_to_text+0x60/0x64
    [<00000000d843d03a>] bch2_trans_paths_to_text+0x10/0x18
    [<00000000fbe77c9c>] bch2_trans_update_max_paths+0x6c/0x104
    [<00000000af279ad9>] __bch2_btree_path_make_mut+0x64/0x1d0
    [<00000000b6ea382b>] __bch2_btree_path_set_pos+0x5c/0x1f4
    [<000000001f2292b9>] bch2_btree_path_set_pos+0x68/0x78
    [<0000000046402275>] bch2_btree_iter_peek_slot+0xd0/0x3b0
    [<00000000a578d851>] bchfs_read.isra.0+0x128/0x77c
    [<00000000d544d588>] bch2_readahead+0x1a0/0x264

Jens Axboe June 28, 2023, 10:33 p.m. UTC | #35

On 6/28/23 4:13?PM, Kent Overstreet wrote:
> On Wed, Jun 28, 2023 at 03:17:43PM -0600, Jens Axboe wrote:
>> Case in point, just changed my reproducer to use aio instead of
>> io_uring. Here's the full script:
>>
>> #!/bin/bash
>>
>> DEV=/dev/nvme1n1
>> MNT=/data
>> ITER=0
>>
>> while true; do
>> 	echo loop $ITER
>> 	sudo mount $DEV $MNT
>> 	fio --name=test --ioengine=aio --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --output=/dev/null &
>> 	Y=$(($RANDOM % 3))
>> 	X=$(($RANDOM % 10))
>> 	VAL="$Y.$X"
>> 	sleep $VAL
>> 	ps -e | grep fio > /dev/null 2>&1
>> 	while [ $? -eq 0 ]; do
>> 		killall -9 fio > /dev/null 2>&1
>> 		echo will wait
>> 		wait > /dev/null 2>&1
>> 		echo done waiting
>> 		ps -e | grep "fio " > /dev/null 2>&1
>> 	done
>> 	sudo umount /data
>> 	if [ $? -ne 0 ]; then
>> 		break
>> 	fi
>> 	((ITER++))
>> done
>>
>> and if I run that, fails on the first umount attempt in that loop:
>>
>> axboe@m1max-kvm ~> bash test2.sh
>> loop 0
>> will wait
>> done waiting
>> umount: /data: target is busy.
>>
>> So yeah, this is _nothing_ new. I really don't think trying to address
>> this in the kernel is the right approach, it'd be a lot saner to harden
>> the xfstest side to deal with the umount a bit more sanely. There are
>> obviously tons of other ways that a mount could get pinned, which isn't
>> too relevant here since the bdev and mount point are basically exclusive
>> to the test being run. But the kill and delayed fput is enough to make
>> that case imho.
> 
> Uh, count me very much not in favor of hacking around bugs elsewhere.
> 
> Al, do you know if this has been considered before? We've got fput()
> being called from aio completion, which often runs out of a worqueue (if
> not a workqueue, a bottom half of some sort - what happens then, I
> wonder) - so the effect is that it goes on the global list, not the task
> work list.
> 
> hence, kill -9ing a process doing aio (or io_uring io, for extra
> reasons) causes umount to fail with -EBUSY.
> 
> and since there's no mechanism for userspace to deal with this besides
> sleep and retry, this seems pretty gross.

But there is, as Christian outlined. I would not call it pretty or
intuitive, but you can in fact make it work just fine and not just for
the deferred fput() case but also in the presence of other kinds of
pins. Of which there are of course many.

> I'd be willing to tackle this for aio since I know that code...

But it's not aio (or io_uring or whatever), it's simply the fact that
doing an fput() from an exiting task (for example) will end up being
done async. And hence waiting for task exits is NOT enough to ensure
that all file references have been released.

Since there are a variety of other reasons why a mount may be pinned and
fail to umount, perhaps it's worth considering that changing this
behavior won't buy us that much. Especially since it's been around for
more than 10 years:

commit 4a9d4b024a3102fc083c925c242d98ac27b1c5f6
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Sun Jun 24 09:56:45 2012 +0400

    switch fput to task_work_add

though that commit message goes on to read:

    We are guaranteed that __fput() will be done before we return
    to userland (or exit).  Note that for fput() from a kernel
    thread we get an async behaviour; it's almost always OK, but
    sometimes you might need to have __fput() completed before
    you do anything else.  There are two mechanisms for that -
    a general barrier (flush_delayed_fput()) and explicit
    __fput_sync().  Both should be used with care (as was the
    case for fput() from kernel threads all along).  See comments
    in fs/file_table.c for details.

where that first sentence isn't true if the task is indeed exiting. I
guess you can say that it is as it doesn't return to userland, but
splitting hairs. Though the commit in question doesn't seem to handle
that case, but assuming that came in with a later fixup.

It is true if the task_work gets added, as that will get run before
returning to userspace.

If a case were to be made that we also guarantee that fput has been done
by the time to task returns to userspace, or exits, then we'd probably
want to move that deferred fput list to the task_struct and ensure that
it gets run if the task exits rather than have a global deferred list.
Currently we have:

1) If kthread or in interrupt
	1a) add to global fput list
2) task_work_add if not. If that fails, goto 1a.

which would then become:

1) If kthread or in interrupt
	1a) add to global fput list
2) task_work_add if not. If that fails, we know task is existing. add to
   per-task defer list to be run at a convenient time before task has
   exited.

and seems a lot saner than hacking around this in umount specifically.

Kent Overstreet June 28, 2023, 10:55 p.m. UTC | #36

On Wed, Jun 28, 2023 at 04:33:55PM -0600, Jens Axboe wrote:
> On 6/28/23 4:13?PM, Kent Overstreet wrote:
> > On Wed, Jun 28, 2023 at 03:17:43PM -0600, Jens Axboe wrote:
> >> Case in point, just changed my reproducer to use aio instead of
> >> io_uring. Here's the full script:
> >>
> >> #!/bin/bash
> >>
> >> DEV=/dev/nvme1n1
> >> MNT=/data
> >> ITER=0
> >>
> >> while true; do
> >> 	echo loop $ITER
> >> 	sudo mount $DEV $MNT
> >> 	fio --name=test --ioengine=aio --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --output=/dev/null &
> >> 	Y=$(($RANDOM % 3))
> >> 	X=$(($RANDOM % 10))
> >> 	VAL="$Y.$X"
> >> 	sleep $VAL
> >> 	ps -e | grep fio > /dev/null 2>&1
> >> 	while [ $? -eq 0 ]; do
> >> 		killall -9 fio > /dev/null 2>&1
> >> 		echo will wait
> >> 		wait > /dev/null 2>&1
> >> 		echo done waiting
> >> 		ps -e | grep "fio " > /dev/null 2>&1
> >> 	done
> >> 	sudo umount /data
> >> 	if [ $? -ne 0 ]; then
> >> 		break
> >> 	fi
> >> 	((ITER++))
> >> done
> >>
> >> and if I run that, fails on the first umount attempt in that loop:
> >>
> >> axboe@m1max-kvm ~> bash test2.sh
> >> loop 0
> >> will wait
> >> done waiting
> >> umount: /data: target is busy.
> >>
> >> So yeah, this is _nothing_ new. I really don't think trying to address
> >> this in the kernel is the right approach, it'd be a lot saner to harden
> >> the xfstest side to deal with the umount a bit more sanely. There are
> >> obviously tons of other ways that a mount could get pinned, which isn't
> >> too relevant here since the bdev and mount point are basically exclusive
> >> to the test being run. But the kill and delayed fput is enough to make
> >> that case imho.
> > 
> > Uh, count me very much not in favor of hacking around bugs elsewhere.
> > 
> > Al, do you know if this has been considered before? We've got fput()
> > being called from aio completion, which often runs out of a worqueue (if
> > not a workqueue, a bottom half of some sort - what happens then, I
> > wonder) - so the effect is that it goes on the global list, not the task
> > work list.
> > 
> > hence, kill -9ing a process doing aio (or io_uring io, for extra
> > reasons) causes umount to fail with -EBUSY.
> > 
> > and since there's no mechanism for userspace to deal with this besides
> > sleep and retry, this seems pretty gross.
> 
> But there is, as Christian outlined. I would not call it pretty or
> intuitive, but you can in fact make it work just fine and not just for
> the deferred fput() case but also in the presence of other kinds of
> pins. Of which there are of course many.

No, because as I explained that just defers the race until when you next
try to use the device, since with lazy umount the device will still be
use when umount returns.

What you'd want is a lazy, synchronous umount, and AFAIK that doesn't
exist.

> > I'd be willing to tackle this for aio since I know that code...
> 
> But it's not aio (or io_uring or whatever), it's simply the fact that
> doing an fput() from an exiting task (for example) will end up being
> done async. And hence waiting for task exits is NOT enough to ensure
> that all file references have been released.
> 
> Since there are a variety of other reasons why a mount may be pinned and
> fail to umount, perhaps it's worth considering that changing this
> behavior won't buy us that much. Especially since it's been around for
> more than 10 years:

Because it seems that before io_uring the race was quite a bit harder to
hit - I only started seeing it when things started switching over to
io_uring. generic/388 used to pass reliably for me (pre backpointers),
now it doesn't.

> commit 4a9d4b024a3102fc083c925c242d98ac27b1c5f6
> Author: Al Viro <viro@zeniv.linux.org.uk>
> Date:   Sun Jun 24 09:56:45 2012 +0400
> 
>     switch fput to task_work_add
> 
> though that commit message goes on to read:
> 
>     We are guaranteed that __fput() will be done before we return
>     to userland (or exit).  Note that for fput() from a kernel
>     thread we get an async behaviour; it's almost always OK, but
>     sometimes you might need to have __fput() completed before
>     you do anything else.  There are two mechanisms for that -
>     a general barrier (flush_delayed_fput()) and explicit
>     __fput_sync().  Both should be used with care (as was the
>     case for fput() from kernel threads all along).  See comments
>     in fs/file_table.c for details.
> 
> where that first sentence isn't true if the task is indeed exiting. I
> guess you can say that it is as it doesn't return to userland, but
> splitting hairs. Though the commit in question doesn't seem to handle
> that case, but assuming that came in with a later fixup.
> 
> It is true if the task_work gets added, as that will get run before
> returning to userspace.

Yes, AIO seems to very much be the exceptional case that wasn't
originally considered.

> If a case were to be made that we also guarantee that fput has been done
> by the time to task returns to userspace, or exits,

And that does seem to be the intent of the original code, no?

> then we'd probably want to move that deferred fput list to the
> task_struct and ensure that it gets run if the task exits rather than
> have a global deferred list. Currently we have:
>
> 
> 1) If kthread or in interrupt
> 	1a) add to global fput list
> 2) task_work_add if not. If that fails, goto 1a.
> 
> which would then become:
> 
> 1) If kthread or in interrupt
> 	1a) add to global fput list
> 2) task_work_add if not. If that fails, we know task is existing. add to
>    per-task defer list to be run at a convenient time before task has
>    exited.

no, it becomes:
 if we're running in a user task, or if we're doing an operation on
 behalf of a user task, add to the user task's deferred list: otherwise
 add to global deferred list.

Kent Overstreet June 28, 2023, 11:04 p.m. UTC | #37

On Wed, Jun 28, 2023 at 04:14:44PM -0600, Jens Axboe wrote:
> Got a whole bunch more running that aio reproducer I sent earlier. I'm
> sure a lot of these are dupes, sending them here for completeness.

Are you running 'echo scan > /sys/kernel/debug/kmemleak' while the test
is running? I see a lot of spurious leaks when I do that that go away if
I scan after everything's shut down.

> 
> [  677.739815] kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> [ 1283.963249] kmemleak: 37 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> 
> unreferenced object 0xffff0000e35de000 (size 8192):
>   comm "mount", pid 3049, jiffies 4294924385 (age 3938.092s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     1d 00 1d 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000005602d414>] __kmalloc_node_track_caller+0xa8/0xd0
>     [<0000000078a13296>] krealloc+0x7c/0xc4
>     [<00000000f1fea4ad>] bch2_sb_realloc+0x12c/0x150
>     [<00000000f03d5ce6>] __copy_super+0x104/0x17c
>     [<000000005567521f>] bch2_sb_to_fs+0x3c/0x80
>     [<0000000062d4e9f6>] bch2_fs_alloc+0x410/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
> unreferenced object 0xffff00020a209900 (size 128):
>   comm "mount", pid 3049, jiffies 4294924385 (age 3938.092s)
>   hex dump (first 32 bytes):
>     03 01 01 00 02 01 01 00 04 01 01 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<00000000cd9515c0>] __kmalloc+0xac/0xd4
>     [<00000000fcf82258>] kmalloc_array.constprop.0+0x18/0x20
>     [<00000000182c3be4>] __bch2_sb_replicas_v0_to_cpu_replicas+0x50/0x118
>     [<0000000012583a94>] bch2_sb_replicas_to_cpu_replicas+0xb0/0xc0
>     [<00000000fcd0b373>] bch2_sb_to_fs+0x4c/0x80
>     [<0000000062d4e9f6>] bch2_fs_alloc+0x410/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
> unreferenced object 0xffff000206785400 (size 128):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.068s)
>   hex dump (first 32 bytes):
>     00 00 d9 20 02 00 ff ff 01 00 00 00 01 04 00 00  ... ............
>     01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000007b360995>] __kmalloc_node+0xac/0xd4
>     [<0000000050ae8904>] mempool_init_node+0x64/0xd8
>     [<00000000e714c59a>] mempool_init+0x14/0x1c
>     [<00000000bb95f8a0>] bch2_fs_alloc+0x690/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
>     [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
>     [<00000000b4ee996a>] el0_svc+0x74/0x9c
> unreferenced object 0xffff000206785700 (size 128):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.076s)
>   hex dump (first 32 bytes):
>     00 00 96 2d 02 00 ff ff 01 00 00 00 01 04 00 00  ...-............
>     01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000007b360995>] __kmalloc_node+0xac/0xd4
>     [<0000000050ae8904>] mempool_init_node+0x64/0xd8
>     [<00000000e714c59a>] mempool_init+0x14/0x1c
>     [<0000000089ab54c3>] bch2_fs_replicas_init+0x64/0xac
>     [<0000000056c4a5fe>] bch2_fs_alloc+0x79c/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
>     [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
> unreferenced object 0xffff000206785600 (size 128):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.076s)
>   hex dump (first 32 bytes):
>     00 1a 05 00 00 00 00 00 00 0c 02 00 00 00 00 00  ................
>     42 9c ba 00 00 00 00 00 00 00 00 00 00 00 00 00  B...............
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<00000000cd9515c0>] __kmalloc+0xac/0xd4
>     [<00000000f949dcc7>] replicas_table_update+0x84/0x214
>     [<000000002debc89d>] bch2_fs_replicas_init+0x74/0xac
>     [<0000000056c4a5fe>] bch2_fs_alloc+0x79c/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
>     [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
>     [<00000000b4ee996a>] el0_svc+0x74/0x9c
> unreferenced object 0xffff000206785580 (size 128):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.076s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 01 00 00 00 01 04 00 00  ................
>     01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<00000000cd9515c0>] __kmalloc+0xac/0xd4
>     [<00000000639b7f33>] replicas_table_update+0x98/0x214
>     [<000000002debc89d>] bch2_fs_replicas_init+0x74/0xac
>     [<0000000056c4a5fe>] bch2_fs_alloc+0x79c/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
>     [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
>     [<00000000b4ee996a>] el0_svc+0x74/0x9c
> unreferenced object 0xffff000206785080 (size 128):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.088s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<00000000cd9515c0>] __kmalloc+0xac/0xd4
>     [<000000001335974a>] __prealloc_shrinker+0x3c/0x60
>     [<0000000017b0bc26>] register_shrinker+0x14/0x34
>     [<00000000c07d01d7>] bch2_fs_btree_cache_init+0xf8/0x150
>     [<000000004b948640>] bch2_fs_alloc+0x7ac/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
>     [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
> unreferenced object 0xffff000200f2ec00 (size 1024):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.088s)
>   hex dump (first 32 bytes):
>     40 00 00 00 00 00 00 00 a8 66 18 09 00 00 00 00  @........f......
>     10 ec f2 00 02 00 ff ff 10 ec f2 00 02 00 ff ff  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000007b360995>] __kmalloc_node+0xac/0xd4
>     [<0000000066405974>] kvmalloc_node+0x54/0xe4
>     [<00000000a51f16c9>] bucket_table_alloc.isra.0+0x44/0x120
>     [<0000000000df2e94>] rhashtable_init+0x148/0x1ac
>     [<0000000080f397f7>] bch2_fs_btree_key_cache_init+0x48/0x90
>     [<0000000089e6783c>] bch2_fs_alloc+0x7c0/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
> unreferenced object 0xffff000206785b80 (size 128):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.088s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<00000000cd9515c0>] __kmalloc+0xac/0xd4
>     [<000000001335974a>] __prealloc_shrinker+0x3c/0x60
>     [<0000000017b0bc26>] register_shrinker+0x14/0x34
>     [<00000000228dd43a>] bch2_fs_btree_key_cache_init+0x88/0x90
>     [<0000000089e6783c>] bch2_fs_alloc+0x7c0/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
>     [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
> unreferenced object 0xffff000206785500 (size 128):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.096s)
>   hex dump (first 32 bytes):
>     00 00 20 2b 02 00 ff ff 01 00 00 00 01 04 00 00  .. +............
>     01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000007b360995>] __kmalloc_node+0xac/0xd4
>     [<0000000050ae8904>] mempool_init_node+0x64/0xd8
>     [<00000000e714c59a>] mempool_init+0x14/0x1c
>     [<00000000fc134979>] bch2_fs_btree_iter_init+0x98/0x130
>     [<00000000addf57f5>] bch2_fs_alloc+0x7d0/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
>     [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
> unreferenced object 0xffff000206785480 (size 128):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.096s)
>   hex dump (first 32 bytes):
>     00 00 97 05 02 00 ff ff 01 00 00 00 01 04 00 00  ................
>     01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000007b360995>] __kmalloc_node+0xac/0xd4
>     [<0000000050ae8904>] mempool_init_node+0x64/0xd8
>     [<00000000e714c59a>] mempool_init+0x14/0x1c
>     [<000000004d03e2b7>] bch2_fs_btree_iter_init+0xb8/0x130
>     [<00000000addf57f5>] bch2_fs_alloc+0x7d0/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
>     [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
> unreferenced object 0xffff000230a31a00 (size 512):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.096s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000009502ae7b>] kmalloc_trace+0x38/0x78
>     [<0000000060cbc45a>] init_srcu_struct_fields+0x38/0x284
>     [<00000000643a7c95>] init_srcu_struct+0x10/0x18
>     [<00000000c46c2041>] bch2_fs_btree_iter_init+0xc8/0x130
>     [<00000000addf57f5>] bch2_fs_alloc+0x7d0/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
>     [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
> unreferenced object 0xffff000222f14f00 (size 256):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.100s)
>   hex dump (first 32 bytes):
>     03 00 00 00 01 00 ff ff 1a cf e0 f2 17 b2 a8 24  ...............$
>     cf 4a ba c3 fb 05 19 cd f6 4d f5 45 e7 e8 29 eb  .J.......M.E..).
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000007b360995>] __kmalloc_node+0xac/0xd4
>     [<0000000066405974>] kvmalloc_node+0x54/0xe4
>     [<00000000c83b22ef>] bch2_fs_buckets_waiting_for_journal_init+0x44/0x6c
>     [<0000000026230712>] bch2_fs_alloc+0x7f0/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
>     [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
>     [<00000000b4ee996a>] el0_svc+0x74/0x9c
> unreferenced object 0xffff000230e00000 (size 720896):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.100s)
>   hex dump (first 32 bytes):
>     40 71 26 38 00 00 00 00 00 00 00 00 00 00 00 00  @q&8............
>     d2 17 f9 2f 75 7e 51 2a 01 01 00 00 14 00 00 00  .../u~Q*........
>   backtrace:
>     [<00000000c6d9e620>] __kmalloc_large_node+0x134/0x164
>     [<000000009024f86b>] __kmalloc_node+0x34/0xd4
>     [<0000000066405974>] kvmalloc_node+0x54/0xe4
>     [<00000000729eb36b>] bch2_fs_btree_write_buffer_init+0x58/0xb4
>     [<000000003e35ba10>] bch2_fs_alloc+0x800/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
>     [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
>     [<00000000b4ee996a>] el0_svc+0x74/0x9c
>     [<00000000a22b66b5>] el0t_64_sync_handler+0xa8/0x134
> unreferenced object 0xffff000230900000 (size 720896):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.100s)
>   hex dump (first 32 bytes):
>     88 96 28 f7 00 00 00 00 00 00 00 00 00 00 00 00  ..(.............
>     d2 17 f9 2f 75 7e 51 2a 01 01 00 00 13 00 00 00  .../u~Q*........
>   backtrace:
>     [<00000000c6d9e620>] __kmalloc_large_node+0x134/0x164
>     [<000000009024f86b>] __kmalloc_node+0x34/0xd4
>     [<0000000066405974>] kvmalloc_node+0x54/0xe4
>     [<00000000f27707f5>] bch2_fs_btree_write_buffer_init+0x7c/0xb4
>     [<000000003e35ba10>] bch2_fs_alloc+0x800/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
>     [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
>     [<00000000b4ee996a>] el0_svc+0x74/0x9c
>     [<00000000a22b66b5>] el0t_64_sync_handler+0xa8/0x134
> unreferenced object 0xffff0000c8d1e300 (size 128):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.108s)
>   hex dump (first 32 bytes):
>     00 c0 0a 02 02 00 ff ff 00 80 5a 20 00 80 ff ff  ..........Z ....
>     00 50 00 00 00 00 00 00 02 00 00 00 00 00 00 00  .P..............
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000007b360995>] __kmalloc_node+0xac/0xd4
>     [<0000000050ae8904>] mempool_init_node+0x64/0xd8
>     [<00000000e714c59a>] mempool_init+0x14/0x1c
>     [<000000001719fe70>] bioset_init+0x188/0x22c
>     [<000000004a1ea042>] bch2_fs_io_init+0x2c/0x124
>     [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
> unreferenced object 0xffff0002020ac000 (size 448):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.108s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 c8 02 00 02 02 00 ff ff  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c
>     [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c
>     [<000000002d6118f3>] mempool_init_node+0x94/0xd8
>     [<00000000e714c59a>] mempool_init+0x14/0x1c
>     [<000000001719fe70>] bioset_init+0x188/0x22c
>     [<000000004a1ea042>] bch2_fs_io_init+0x2c/0x124
>     [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
> unreferenced object 0xffff0000c8d1e500 (size 128):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.108s)
>   hex dump (first 32 bytes):
>     00 40 e4 c9 00 00 ff ff c0 d9 9a 03 00 fc ff ff  .@..............
>     80 d9 9a 03 00 fc ff ff 40 d9 9a 03 00 fc ff ff  ........@.......
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000007b360995>] __kmalloc_node+0xac/0xd4
>     [<0000000050ae8904>] mempool_init_node+0x64/0xd8
>     [<00000000e714c59a>] mempool_init+0x14/0x1c
>     [<000000002f5588b4>] biovec_init_pool+0x24/0x2c
>     [<00000000a2b87494>] bioset_init+0x208/0x22c
>     [<000000004a1ea042>] bch2_fs_io_init+0x2c/0x124
>     [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
> unreferenced object 0xffff0000c8d1e900 (size 128):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.116s)
>   hex dump (first 32 bytes):
>     00 c2 0a 02 02 00 ff ff 00 00 58 20 00 80 ff ff  ..........X ....
>     00 50 00 00 00 00 00 00 02 00 00 00 00 00 00 00  .P..............
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000007b360995>] __kmalloc_node+0xac/0xd4
>     [<0000000050ae8904>] mempool_init_node+0x64/0xd8
>     [<00000000e714c59a>] mempool_init+0x14/0x1c
>     [<000000001719fe70>] bioset_init+0x188/0x22c
>     [<000000007af2eb34>] bch2_fs_io_init+0x48/0x124
>     [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
> unreferenced object 0xffff0002020ac200 (size 448):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.116s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 0d 69 37 bf  .............i7.
>     00 00 00 00 00 00 00 00 c8 c2 0a 02 02 00 ff ff  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c
>     [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c
>     [<000000002d6118f3>] mempool_init_node+0x94/0xd8
>     [<00000000e714c59a>] mempool_init+0x14/0x1c
>     [<000000001719fe70>] bioset_init+0x188/0x22c
>     [<000000007af2eb34>] bch2_fs_io_init+0x48/0x124
>     [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
> unreferenced object 0xffff0000c8d1e980 (size 128):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.116s)
>   hex dump (first 32 bytes):
>     00 50 e4 c9 00 00 ff ff c0 dc 9a 03 00 fc ff ff  .P..............
>     80 dc 9a 03 00 fc ff ff 40 44 23 03 00 fc ff ff  ........@D#.....
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000007b360995>] __kmalloc_node+0xac/0xd4
>     [<0000000050ae8904>] mempool_init_node+0x64/0xd8
>     [<00000000e714c59a>] mempool_init+0x14/0x1c
>     [<000000002f5588b4>] biovec_init_pool+0x24/0x2c
>     [<00000000a2b87494>] bioset_init+0x208/0x22c
>     [<000000007af2eb34>] bch2_fs_io_init+0x48/0x124
>     [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
> unreferenced object 0xffff000230a31e00 (size 512):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.120s)
>   hex dump (first 32 bytes):
>     00 9f 39 08 00 fc ff ff 40 f9 99 08 00 fc ff ff  ..9.....@.......
>     40 c0 8b 08 00 fc ff ff 00 d8 14 08 00 fc ff ff  @...............
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000007b360995>] __kmalloc_node+0xac/0xd4
>     [<0000000050ae8904>] mempool_init_node+0x64/0xd8
>     [<00000000e714c59a>] mempool_init+0x14/0x1c
>     [<000000009f58f780>] bch2_fs_io_init+0x9c/0x124
>     [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
>     [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
> unreferenced object 0xffff000200f2e800 (size 1024):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.120s)
>   hex dump (first 32 bytes):
>     40 00 00 00 00 00 00 00 89 16 1e cd 00 00 00 00  @...............
>     10 e8 f2 00 02 00 ff ff 10 e8 f2 00 02 00 ff ff  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000007b360995>] __kmalloc_node+0xac/0xd4
>     [<0000000066405974>] kvmalloc_node+0x54/0xe4
>     [<00000000a51f16c9>] bucket_table_alloc.isra.0+0x44/0x120
>     [<0000000000df2e94>] rhashtable_init+0x148/0x1ac
>     [<00000000347789c6>] bch2_fs_io_init+0xb8/0x124
>     [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
> unreferenced object 0xffff000222f14700 (size 256):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.120s)
>   hex dump (first 32 bytes):
>     68 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  h...............
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000007b360995>] __kmalloc_node+0xac/0xd4
>     [<000000003a6af69a>] crypto_alloc_tfmmem+0x3c/0x70
>     [<000000006c0841c0>] crypto_create_tfm_node+0x20/0xa0
>     [<00000000b0aa6a0f>] crypto_alloc_tfm_node+0x94/0xac
>     [<00000000a2421d04>] crypto_alloc_shash+0x20/0x28
>     [<00000000aeafee8e>] bch2_fs_encryption_init+0x64/0x150
>     [<0000000002e060b3>] bch2_fs_alloc+0x840/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
> unreferenced object 0xffff00020e544100 (size 128):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.128s)
>   hex dump (first 32 bytes):
>     00 00 f0 20 02 00 ff ff 80 04 f0 20 02 00 ff ff  ... ....... ....
>     00 09 f0 20 02 00 ff ff 80 0d f0 20 02 00 ff ff  ... ....... ....
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000007b360995>] __kmalloc_node+0xac/0xd4
>     [<0000000050ae8904>] mempool_init_node+0x64/0xd8
>     [<00000000e714c59a>] mempool_init+0x14/0x1c
>     [<000000001719fe70>] bioset_init+0x188/0x22c
>     [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130
>     [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
> unreferenced object 0xffff000220f00000 (size 1104):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.128s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 98 e7 87 30 02 00 ff ff  ...........0....
>     22 01 00 00 00 00 ad de 18 00 f0 20 02 00 ff ff  ".......... ....
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c
>     [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c
>     [<000000002d6118f3>] mempool_init_node+0x94/0xd8
>     [<00000000e714c59a>] mempool_init+0x14/0x1c
>     [<000000001719fe70>] bioset_init+0x188/0x22c
>     [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130
>     [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
> unreferenced object 0xffff000220f00480 (size 1104):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.128s)
>   hex dump (first 32 bytes):
>     80 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     01 00 00 00 00 00 00 00 00 00 00 00 9d a2 98 2c  ...............,
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c
>     [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c
>     [<000000002d6118f3>] mempool_init_node+0x94/0xd8
>     [<00000000e714c59a>] mempool_init+0x14/0x1c
>     [<000000001719fe70>] bioset_init+0x188/0x22c
>     [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130
>     [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
> unreferenced object 0xffff000220f00900 (size 1104):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.132s)
>   hex dump (first 32 bytes):
>     22 01 00 00 00 00 ad de 08 09 f0 20 02 00 ff ff  ".......... ....
>     08 09 f0 20 02 00 ff ff b9 17 f0 20 02 00 ff ff  ... ....... ....
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c
>     [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c
>     [<000000002d6118f3>] mempool_init_node+0x94/0xd8
>     [<00000000e714c59a>] mempool_init+0x14/0x1c
>     [<000000001719fe70>] bioset_init+0x188/0x22c
>     [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130
>     [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
> unreferenced object 0xffff000220f00d80 (size 1104):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.132s)
>   hex dump (first 32 bytes):
>     01 00 00 00 00 00 00 00 00 00 00 00 34 43 3f b1  ............4C?.
>     00 00 00 00 00 00 00 00 24 00 bb 04 02 28 1e 3b  ........$....(.;
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c
>     [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c
>     [<000000002d6118f3>] mempool_init_node+0x94/0xd8
>     [<00000000e714c59a>] mempool_init+0x14/0x1c
>     [<000000001719fe70>] bioset_init+0x188/0x22c
>     [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130
>     [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
> unreferenced object 0xffff00020e544000 (size 128):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.132s)
>   hex dump (first 32 bytes):
>     00 00 b5 05 02 00 ff ff 00 50 b5 05 02 00 ff ff  .........P......
>     00 10 b5 05 02 00 ff ff 00 b0 18 09 02 00 ff ff  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000007b360995>] __kmalloc_node+0xac/0xd4
>     [<0000000050ae8904>] mempool_init_node+0x64/0xd8
>     [<00000000e714c59a>] mempool_init+0x14/0x1c
>     [<000000002f5588b4>] biovec_init_pool+0x24/0x2c
>     [<00000000a2b87494>] bioset_init+0x208/0x22c
>     [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130
>     [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
> unreferenced object 0xffff00020918b000 (size 4096):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.140s)
>   hex dump (first 32 bytes):
>     c0 5b 26 03 00 fc ff ff 00 10 00 00 00 00 00 00  .[&.............
>     22 01 00 00 00 00 ad de 18 b0 18 09 02 00 ff ff  "...............
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c
>     [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c
>     [<000000002d6118f3>] mempool_init_node+0x94/0xd8
>     [<00000000e714c59a>] mempool_init+0x14/0x1c
>     [<000000002f5588b4>] biovec_init_pool+0x24/0x2c
>     [<00000000a2b87494>] bioset_init+0x208/0x22c
>     [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130
>     [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc
>     [<00000000223e06bf>] bch2_fs_open+0x19c/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
> unreferenced object 0xffff000200f2f400 (size 1024):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.140s)
>   hex dump (first 32 bytes):
>     19 00 00 00 00 00 00 00 9d 19 00 00 00 00 00 00  ................
>     a6 19 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<00000000cd9515c0>] __kmalloc+0xac/0xd4
>     [<0000000097d0e280>] bch2_blacklist_table_initialize+0x48/0xc4
>     [<000000007af2f7c0>] bch2_fs_recovery+0x220/0x140c
>     [<00000000835fe5c8>] bch2_fs_start+0x104/0x2ac
>     [<00000000f2c8e79f>] bch2_fs_open+0x2cc/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
>     [<00000000f493e836>] __arm64_sys_mount+0x150/0x168
>     [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8
>     [<00000000e707b03d>] do_el0_svc+0xbc/0xf0
>     [<00000000b4ee996a>] el0_svc+0x74/0x9c
> unreferenced object 0xffff00020e544800 (size 128):
>   comm "mount", pid 3049, jiffies 4294924391 (age 3938.140s)
>   hex dump (first 32 bytes):
>     07 00 00 01 09 00 00 00 73 74 61 72 74 69 6e 67  ........starting
>     20 6a 6f 75 72 6e 61 6c 20 61 74 20 65 6e 74 72   journal at entr
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000005602d414>] __kmalloc_node_track_caller+0xa8/0xd0
>     [<0000000078a13296>] krealloc+0x7c/0xc4
>     [<00000000224f82f4>] __darray_make_room.constprop.0+0x5c/0x7c
>     [<00000000caa2f6f2>] __bch2_trans_log_msg+0x80/0x12c
>     [<0000000034a8dfea>] __bch2_fs_log_msg+0x68/0x158
>     [<00000000cc0719ad>] bch2_journal_log_msg+0x60/0x98
>     [<00000000a0b3d87b>] bch2_fs_recovery+0x8f0/0x140c
>     [<00000000835fe5c8>] bch2_fs_start+0x104/0x2ac
>     [<00000000f2c8e79f>] bch2_fs_open+0x2cc/0x430
>     [<00000000e72d508e>] bch2_mount+0x194/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
> unreferenced object 0xffff00020f2d8398 (size 184):
>   comm "mount", pid 3049, jiffies 4294924395 (age 3938.128s)
>   hex dump (first 32 bytes):
>     00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c
>     [<0000000059ea6346>] bch2_btree_path_traverse_cached_slowpath+0x240/0x9d0
>     [<00000000b340fce9>] bch2_btree_path_traverse_cached+0x7c/0x184
>     [<000000006b501914>] bch2_btree_path_traverse_one+0xbc/0x4f0
>     [<0000000046611bb8>] bch2_btree_path_traverse+0x20/0x30
>     [<00000000cb7378ca>] bch2_btree_iter_peek_slot+0xe4/0x3b0
>     [<000000005b36d96f>] __bch2_bkey_get_iter.constprop.0+0x40/0x74
>     [<00000000db6c00c7>] bch2_inode_peek+0x80/0xfc
>     [<00000000d48fafeb>] bch2_inode_find_by_inum_trans+0x34/0x74
>     [<00000000d94a8ca3>] bch2_vfs_inode_get+0xdc/0x1a0
>     [<00000000b7cffdf2>] bch2_mount+0x3bc/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
>     [<00000000527b4561>] path_mount+0x5d0/0x6c8
>     [<00000000dc643d96>] do_mount+0x80/0xa4
> unreferenced object 0xffff000222f15e00 (size 256):
>   comm "mount", pid 3049, jiffies 4294924395 (age 3938.128s)
>   hex dump (first 32 bytes):
>     12 81 1d 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 ff ff ff ff 00 10 00 00 00 00 00 00  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<00000000cd9515c0>] __kmalloc+0xac/0xd4
>     [<0000000080dcf5d4>] btree_key_cache_fill+0x190/0x338
>     [<00000000142a161b>] bch2_btree_path_traverse_cached_slowpath+0x8d8/0x9d0
>     [<00000000b340fce9>] bch2_btree_path_traverse_cached+0x7c/0x184
>     [<000000006b501914>] bch2_btree_path_traverse_one+0xbc/0x4f0
>     [<0000000046611bb8>] bch2_btree_path_traverse+0x20/0x30
>     [<00000000cb7378ca>] bch2_btree_iter_peek_slot+0xe4/0x3b0
>     [<000000005b36d96f>] __bch2_bkey_get_iter.constprop.0+0x40/0x74
>     [<00000000db6c00c7>] bch2_inode_peek+0x80/0xfc
>     [<00000000d48fafeb>] bch2_inode_find_by_inum_trans+0x34/0x74
>     [<00000000d94a8ca3>] bch2_vfs_inode_get+0xdc/0x1a0
>     [<00000000b7cffdf2>] bch2_mount+0x3bc/0x45c
>     [<00000000b040daa5>] legacy_get_tree+0x2c/0x54
>     [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4
> unreferenced object 0xffff00020ab0da80 (size 128):
>   comm "fio", pid 3068, jiffies 4294924399 (age 3938.112s)
>   hex dump (first 32 bytes):
>     70 61 74 68 3a 20 69 64 78 20 20 30 20 72 65 66  path: idx  0 ref
>     20 30 3a 30 20 50 20 53 20 62 74 72 65 65 3d 73   0:0 P S btree=s
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000005602d414>] __kmalloc_node_track_caller+0xa8/0xd0
>     [<0000000078a13296>] krealloc+0x7c/0xc4
>     [<00000000ac6de278>] bch2_printbuf_make_room+0x6c/0x9c
>     [<00000000e73dab89>] bch2_prt_printf+0xac/0x104
>     [<00000000ef2c8dc5>] bch2_btree_path_to_text+0x6c/0xb8
>     [<00000000eab3e43c>] __bch2_trans_paths_to_text+0x60/0x64
>     [<00000000d843d03a>] bch2_trans_paths_to_text+0x10/0x18
>     [<00000000fbe77c9c>] bch2_trans_update_max_paths+0x6c/0x104
>     [<00000000715f184d>] btree_path_alloc+0x44/0x140
>     [<0000000028aac82e>] bch2_path_get+0x190/0x210
>     [<000000001fbd1416>] bch2_trans_iter_init_outlined+0xd4/0x100
>     [<00000000b7c2c8e8>] bch2_trans_iter_init.constprop.0+0x28/0x30
>     [<000000005ee45b0d>] __bch2_dirent_lookup_trans+0xc4/0x20c
>     [<00000000bf9849b2>] bch2_dirent_lookup+0x9c/0x10c
> unreferenced object 0xffff00020f2d8450 (size 184):
>   comm "fio", pid 3068, jiffies 4294924400 (age 3938.112s)
>   hex dump (first 32 bytes):
>     00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c
>     [<0000000059ea6346>] bch2_btree_path_traverse_cached_slowpath+0x240/0x9d0
>     [<00000000b340fce9>] bch2_btree_path_traverse_cached+0x7c/0x184
>     [<000000006b501914>] bch2_btree_path_traverse_one+0xbc/0x4f0
>     [<0000000046611bb8>] bch2_btree_path_traverse+0x20/0x30
>     [<00000000cb7378ca>] bch2_btree_iter_peek_slot+0xe4/0x3b0
>     [<000000005b36d96f>] __bch2_bkey_get_iter.constprop.0+0x40/0x74
>     [<00000000db6c00c7>] bch2_inode_peek+0x80/0xfc
>     [<00000000d48fafeb>] bch2_inode_find_by_inum_trans+0x34/0x74
>     [<00000000d94a8ca3>] bch2_vfs_inode_get+0xdc/0x1a0
>     [<00000000225a6085>] bch2_lookup+0x7c/0xb8
>     [<0000000059304a98>] __lookup_slow+0xd4/0x114
>     [<000000001225c82d>] walk_component+0x98/0xd4
>     [<0000000095114e46>] path_lookupat+0x84/0x114
>     [<000000002ee74fa2>] filename_lookup+0x54/0xc4
> unreferenced object 0xffff000222f15a00 (size 256):
>   comm "fio", pid 3068, jiffies 4294924400 (age 3938.112s)
>   hex dump (first 32 bytes):
>     13 81 1d 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 ff ff ff ff f3 15 00 30 00 00 00 00  ...........0....
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<00000000cd9515c0>] __kmalloc+0xac/0xd4
>     [<0000000080dcf5d4>] btree_key_cache_fill+0x190/0x338
>     [<00000000142a161b>] bch2_btree_path_traverse_cached_slowpath+0x8d8/0x9d0
>     [<00000000b340fce9>] bch2_btree_path_traverse_cached+0x7c/0x184
>     [<000000006b501914>] bch2_btree_path_traverse_one+0xbc/0x4f0
>     [<0000000046611bb8>] bch2_btree_path_traverse+0x20/0x30
>     [<00000000cb7378ca>] bch2_btree_iter_peek_slot+0xe4/0x3b0
>     [<000000005b36d96f>] __bch2_bkey_get_iter.constprop.0+0x40/0x74
>     [<00000000db6c00c7>] bch2_inode_peek+0x80/0xfc
>     [<00000000d48fafeb>] bch2_inode_find_by_inum_trans+0x34/0x74
>     [<00000000d94a8ca3>] bch2_vfs_inode_get+0xdc/0x1a0
>     [<00000000225a6085>] bch2_lookup+0x7c/0xb8
>     [<0000000059304a98>] __lookup_slow+0xd4/0x114
>     [<000000001225c82d>] walk_component+0x98/0xd4
> unreferenced object 0xffff0002058e0e00 (size 256):
>   comm "fio", pid 3081, jiffies 4294924461 (age 3937.868s)
>   hex dump (first 32 bytes):
>     70 61 74 68 3a 20 69 64 78 20 20 31 20 72 65 66  path: idx  1 ref
>     20 30 3a 30 20 50 20 20 20 62 74 72 65 65 3d 65   0:0 P   btree=e
>   backtrace:
>     [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc
>     [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178
>     [<000000005602d414>] __kmalloc_node_track_caller+0xa8/0xd0
>     [<0000000078a13296>] krealloc+0x7c/0xc4
>     [<00000000ac6de278>] bch2_printbuf_make_room+0x6c/0x9c
>     [<00000000e73dab89>] bch2_prt_printf+0xac/0x104
>     [<00000000ef2c8dc5>] bch2_btree_path_to_text+0x6c/0xb8
>     [<00000000eab3e43c>] __bch2_trans_paths_to_text+0x60/0x64
>     [<00000000d843d03a>] bch2_trans_paths_to_text+0x10/0x18
>     [<00000000fbe77c9c>] bch2_trans_update_max_paths+0x6c/0x104
>     [<00000000af279ad9>] __bch2_btree_path_make_mut+0x64/0x1d0
>     [<00000000b6ea382b>] __bch2_btree_path_set_pos+0x5c/0x1f4
>     [<000000001f2292b9>] bch2_btree_path_set_pos+0x68/0x78
>     [<0000000046402275>] bch2_btree_iter_peek_slot+0xd0/0x3b0
>     [<00000000a578d851>] bchfs_read.isra.0+0x128/0x77c
>     [<00000000d544d588>] bch2_readahead+0x1a0/0x264
> 
> -- 
> Jens Axboe
>

Jens Axboe June 28, 2023, 11:11 p.m. UTC | #38

On 6/28/23 5:04 PM, Kent Overstreet wrote:
> On Wed, Jun 28, 2023 at 04:14:44PM -0600, Jens Axboe wrote:
>> Got a whole bunch more running that aio reproducer I sent earlier. I'm
>> sure a lot of these are dupes, sending them here for completeness.
> 
> Are you running 'echo scan > /sys/kernel/debug/kmemleak' while the test
> is running? I see a lot of spurious leaks when I do that that go away if
> I scan after everything's shut down.

Nope, and they remain in there. The cat dump I took was an hour later.

Jens Axboe June 28, 2023, 11:14 p.m. UTC | #39

On 6/28/23 4:55?PM, Kent Overstreet wrote:
>> But it's not aio (or io_uring or whatever), it's simply the fact that
>> doing an fput() from an exiting task (for example) will end up being
>> done async. And hence waiting for task exits is NOT enough to ensure
>> that all file references have been released.
>>
>> Since there are a variety of other reasons why a mount may be pinned and
>> fail to umount, perhaps it's worth considering that changing this
>> behavior won't buy us that much. Especially since it's been around for
>> more than 10 years:
> 
> Because it seems that before io_uring the race was quite a bit harder to
> hit - I only started seeing it when things started switching over to
> io_uring. generic/388 used to pass reliably for me (pre backpointers),
> now it doesn't.

I literally just pasted a script that hits it in one second with aio. So
maybe generic/388 doesn't hit it as easily, but it's surely TRIVIAL to
hit with aio. As demonstrated. The io_uring is not hard to bring into
parity on that front, here's one I posted earlier today for 6.5:

https://lore.kernel.org/io-uring/20230628170953.952923-4-axboe@kernel.dk/

Doesn't change the fact that you can easily hit this with io_uring or
aio, and probably more things too (didn't look any further). Is it a
realistic thing outside of funky tests? Probably not really, or at least
if those guys hit it they'd probably have the work-around hack in place
in their script already.

But the fact is that it's been around for a decade. It's somehow a lot
easier to hit with bcachefs than XFS, which may just be because the
former has a bunch of workers and this may be deferring the delayed fput
work more. Just hand waving.

>> then we'd probably want to move that deferred fput list to the
>> task_struct and ensure that it gets run if the task exits rather than
>> have a global deferred list. Currently we have:
>>
>>
>> 1) If kthread or in interrupt
>> 	1a) add to global fput list
>> 2) task_work_add if not. If that fails, goto 1a.
>>
>> which would then become:
>>
>> 1) If kthread or in interrupt
>> 	1a) add to global fput list
>> 2) task_work_add if not. If that fails, we know task is existing. add to
>>    per-task defer list to be run at a convenient time before task has
>>    exited.
> 
> no, it becomes:
>  if we're running in a user task, or if we're doing an operation on
>  behalf of a user task, add to the user task's deferred list: otherwise
>  add to global deferred list.

And how would the "on behalf of a user task" work in terms of being
in_interrupt()?

Kent Overstreet June 28, 2023, 11:50 p.m. UTC | #40

On Wed, Jun 28, 2023 at 05:14:09PM -0600, Jens Axboe wrote:
> On 6/28/23 4:55?PM, Kent Overstreet wrote:
> >> But it's not aio (or io_uring or whatever), it's simply the fact that
> >> doing an fput() from an exiting task (for example) will end up being
> >> done async. And hence waiting for task exits is NOT enough to ensure
> >> that all file references have been released.
> >>
> >> Since there are a variety of other reasons why a mount may be pinned and
> >> fail to umount, perhaps it's worth considering that changing this
> >> behavior won't buy us that much. Especially since it's been around for
> >> more than 10 years:
> > 
> > Because it seems that before io_uring the race was quite a bit harder to
> > hit - I only started seeing it when things started switching over to
> > io_uring. generic/388 used to pass reliably for me (pre backpointers),
> > now it doesn't.
> 
> I literally just pasted a script that hits it in one second with aio. So
> maybe generic/388 doesn't hit it as easily, but it's surely TRIVIAL to
> hit with aio. As demonstrated. The io_uring is not hard to bring into
> parity on that front, here's one I posted earlier today for 6.5:
> 
> https://lore.kernel.org/io-uring/20230628170953.952923-4-axboe@kernel.dk/
> 
> Doesn't change the fact that you can easily hit this with io_uring or
> aio, and probably more things too (didn't look any further). Is it a
> realistic thing outside of funky tests? Probably not really, or at least
> if those guys hit it they'd probably have the work-around hack in place
> in their script already.
> 
> But the fact is that it's been around for a decade. It's somehow a lot
> easier to hit with bcachefs than XFS, which may just be because the
> former has a bunch of workers and this may be deferring the delayed fput
> work more. Just hand waving.

Not sure what you're arguing here...?

We've had a long standing bug, it's recently become much easier to hit
(for multiple reasons); we seem to be in agreement on all that. All I'm
saying is that the existence of that bug previously is not reason to fix
it now.

> >> then we'd probably want to move that deferred fput list to the
> >> task_struct and ensure that it gets run if the task exits rather than
> >> have a global deferred list. Currently we have:
> >>
> >>
> >> 1) If kthread or in interrupt
> >> 	1a) add to global fput list
> >> 2) task_work_add if not. If that fails, goto 1a.
> >>
> >> which would then become:
> >>
> >> 1) If kthread or in interrupt
> >> 	1a) add to global fput list
> >> 2) task_work_add if not. If that fails, we know task is existing. add to
> >>    per-task defer list to be run at a convenient time before task has
> >>    exited.
> > 
> > no, it becomes:
> >  if we're running in a user task, or if we're doing an operation on
> >  behalf of a user task, add to the user task's deferred list: otherwise
> >  add to global deferred list.
> 
> And how would the "on behalf of a user task" work in terms of being
> in_interrupt()?

I don't see any relation to in_interrupt?

We'd have to add a version of fput() that takes an additional
task_struct argument, and plumb that through the aio code - kioctx
lifetime is tied to mm_struct, not task_struct, so we'd have to add a
ref to the task_struct to kiocb.

Which would probably be a good thing tbh, it'd let us e.g. account cpu
time back to the original task when kiocb completion has to run out of a
workqueue.

Dave Chinner June 29, 2023, 1 a.m. UTC | #41

On Wed, Jun 28, 2023 at 07:50:18PM -0400, Kent Overstreet wrote:
> On Wed, Jun 28, 2023 at 05:14:09PM -0600, Jens Axboe wrote:
> > On 6/28/23 4:55?PM, Kent Overstreet wrote:
> > >> But it's not aio (or io_uring or whatever), it's simply the fact that
> > >> doing an fput() from an exiting task (for example) will end up being
> > >> done async. And hence waiting for task exits is NOT enough to ensure
> > >> that all file references have been released.
> > >>
> > >> Since there are a variety of other reasons why a mount may be pinned and
> > >> fail to umount, perhaps it's worth considering that changing this
> > >> behavior won't buy us that much. Especially since it's been around for
> > >> more than 10 years:
> > > 
> > > Because it seems that before io_uring the race was quite a bit harder to
> > > hit - I only started seeing it when things started switching over to
> > > io_uring. generic/388 used to pass reliably for me (pre backpointers),
> > > now it doesn't.
> > 
> > I literally just pasted a script that hits it in one second with aio. So
> > maybe generic/388 doesn't hit it as easily, but it's surely TRIVIAL to
> > hit with aio. As demonstrated. The io_uring is not hard to bring into
> > parity on that front, here's one I posted earlier today for 6.5:
> > 
> > https://lore.kernel.org/io-uring/20230628170953.952923-4-axboe@kernel.dk/
> > 
> > Doesn't change the fact that you can easily hit this with io_uring or
> > aio, and probably more things too (didn't look any further). Is it a
> > realistic thing outside of funky tests? Probably not really, or at least
> > if those guys hit it they'd probably have the work-around hack in place
> > in their script already.
> > 
> > But the fact is that it's been around for a decade. It's somehow a lot
> > easier to hit with bcachefs than XFS, which may just be because the
> > former has a bunch of workers and this may be deferring the delayed fput
> > work more. Just hand waving.
> 
> Not sure what you're arguing here...?
> 
> We've had a long standing bug, it's recently become much easier to hit
> (for multiple reasons); we seem to be in agreement on all that. All I'm
> saying is that the existence of that bug previously is not reason to fix
> it now.

I agree with Kent here  - the kernel bug needs to be fixed
regardless of how long it has been around. Blaming the messenger
(userspace, fstests, etc) and saying it should work around a
spurious, unpredictable, undesirable and user-undebuggable kernel
behaviour is not an acceptible solution here...

I don't care how the kernel bug gets fixed, I just want the spurious
unmount failures when there are no userspace processes actively
using the filesytsem to go away forever.

-Dave.

Jens Axboe June 29, 2023, 1:29 a.m. UTC | #42

On 6/28/23 5:50?PM, Kent Overstreet wrote:
> On Wed, Jun 28, 2023 at 05:14:09PM -0600, Jens Axboe wrote:
>> On 6/28/23 4:55?PM, Kent Overstreet wrote:
>>>> But it's not aio (or io_uring or whatever), it's simply the fact that
>>>> doing an fput() from an exiting task (for example) will end up being
>>>> done async. And hence waiting for task exits is NOT enough to ensure
>>>> that all file references have been released.
>>>>
>>>> Since there are a variety of other reasons why a mount may be pinned and
>>>> fail to umount, perhaps it's worth considering that changing this
>>>> behavior won't buy us that much. Especially since it's been around for
>>>> more than 10 years:
>>>
>>> Because it seems that before io_uring the race was quite a bit harder to
>>> hit - I only started seeing it when things started switching over to
>>> io_uring. generic/388 used to pass reliably for me (pre backpointers),
>>> now it doesn't.
>>
>> I literally just pasted a script that hits it in one second with aio. So
>> maybe generic/388 doesn't hit it as easily, but it's surely TRIVIAL to
>> hit with aio. As demonstrated. The io_uring is not hard to bring into
>> parity on that front, here's one I posted earlier today for 6.5:
>>
>> https://lore.kernel.org/io-uring/20230628170953.952923-4-axboe@kernel.dk/
>>
>> Doesn't change the fact that you can easily hit this with io_uring or
>> aio, and probably more things too (didn't look any further). Is it a
>> realistic thing outside of funky tests? Probably not really, or at least
>> if those guys hit it they'd probably have the work-around hack in place
>> in their script already.
>>
>> But the fact is that it's been around for a decade. It's somehow a lot
>> easier to hit with bcachefs than XFS, which may just be because the
>> former has a bunch of workers and this may be deferring the delayed fput
>> work more. Just hand waving.
> 
> Not sure what you're arguing here...?
> 
> We've had a long standing bug, it's recently become much easier to hit
> (for multiple reasons); we seem to be in agreement on all that. All I'm
> saying is that the existence of that bug previously is not reason to fix
> it now.

Not really arguing, just stating that it's not a huge problem as it's
not something that real world would tend to do and probably why we saw
it in a test case instead.

>>>> then we'd probably want to move that deferred fput list to the
>>>> task_struct and ensure that it gets run if the task exits rather than
>>>> have a global deferred list. Currently we have:
>>>>
>>>>
>>>> 1) If kthread or in interrupt
>>>> 	1a) add to global fput list
>>>> 2) task_work_add if not. If that fails, goto 1a.
>>>>
>>>> which would then become:
>>>>
>>>> 1) If kthread or in interrupt
>>>> 	1a) add to global fput list
>>>> 2) task_work_add if not. If that fails, we know task is existing. add to
>>>>    per-task defer list to be run at a convenient time before task has
>>>>    exited.
>>>
>>> no, it becomes:
>>>  if we're running in a user task, or if we're doing an operation on
>>>  behalf of a user task, add to the user task's deferred list: otherwise
>>>  add to global deferred list.
>>
>> And how would the "on behalf of a user task" work in terms of being
>> in_interrupt()?
> 
> I don't see any relation to in_interrupt?

Just saying that you'd now need the task passed in.

> We'd have to add a version of fput() that takes an additional
> task_struct argument, and plumb that through the aio code - kioctx
> lifetime is tied to mm_struct, not task_struct, so we'd have to add a
> ref to the task_struct to kiocb.
> 
> Which would probably be a good thing tbh, it'd let us e.g. account cpu
> time back to the original task when kiocb completion has to run out of a
> workqueue.

Might also introduce some funky dependencies. Probably not an issue it
tied to the aio_kiocb. If you go ahead with that, just make sure you
keep the task referencing out of the fput variant for users that don't
need that.

Jens Axboe June 29, 2023, 1:33 a.m. UTC | #43

On 6/28/23 7:00?PM, Dave Chinner wrote:
> On Wed, Jun 28, 2023 at 07:50:18PM -0400, Kent Overstreet wrote:
>> On Wed, Jun 28, 2023 at 05:14:09PM -0600, Jens Axboe wrote:
>>> On 6/28/23 4:55?PM, Kent Overstreet wrote:
>>>>> But it's not aio (or io_uring or whatever), it's simply the fact that
>>>>> doing an fput() from an exiting task (for example) will end up being
>>>>> done async. And hence waiting for task exits is NOT enough to ensure
>>>>> that all file references have been released.
>>>>>
>>>>> Since there are a variety of other reasons why a mount may be pinned and
>>>>> fail to umount, perhaps it's worth considering that changing this
>>>>> behavior won't buy us that much. Especially since it's been around for
>>>>> more than 10 years:
>>>>
>>>> Because it seems that before io_uring the race was quite a bit harder to
>>>> hit - I only started seeing it when things started switching over to
>>>> io_uring. generic/388 used to pass reliably for me (pre backpointers),
>>>> now it doesn't.
>>>
>>> I literally just pasted a script that hits it in one second with aio. So
>>> maybe generic/388 doesn't hit it as easily, but it's surely TRIVIAL to
>>> hit with aio. As demonstrated. The io_uring is not hard to bring into
>>> parity on that front, here's one I posted earlier today for 6.5:
>>>
>>> https://lore.kernel.org/io-uring/20230628170953.952923-4-axboe@kernel.dk/
>>>
>>> Doesn't change the fact that you can easily hit this with io_uring or
>>> aio, and probably more things too (didn't look any further). Is it a
>>> realistic thing outside of funky tests? Probably not really, or at least
>>> if those guys hit it they'd probably have the work-around hack in place
>>> in their script already.
>>>
>>> But the fact is that it's been around for a decade. It's somehow a lot
>>> easier to hit with bcachefs than XFS, which may just be because the
>>> former has a bunch of workers and this may be deferring the delayed fput
>>> work more. Just hand waving.
>>
>> Not sure what you're arguing here...?
>>
>> We've had a long standing bug, it's recently become much easier to hit
>> (for multiple reasons); we seem to be in agreement on all that. All I'm
>> saying is that the existence of that bug previously is not reason to fix
>> it now.
> 
> I agree with Kent here  - the kernel bug needs to be fixed
> regardless of how long it has been around. Blaming the messenger
> (userspace, fstests, etc) and saying it should work around a
> spurious, unpredictable, undesirable and user-undebuggable kernel
> behaviour is not an acceptible solution here...

Not sure why you both are putting words in my mouth, I've merely been
arguing pros and cons and the impact of this. I even linked the io_uring
addition for ensuring that side will work better once the deferred fput
is sorted out. I didn't like the idea of fixing this through umount, and
even outlined how it could be fixed properly by ensuring we flush
per-task deferred puts on task exit.

Do I think it's a big issue? Not at all, because a) nobody has reported
it until now, and b) it's kind of a stupid case. If we can fix it with
minimal impact, should we? Yep. Particularly as the assumptions stated
in the original commit I referenced were not even valid back then.

Christian Brauner June 29, 2023, 11:18 a.m. UTC | #44

On Wed, Jun 28, 2023 at 07:33:18PM -0600, Jens Axboe wrote:
> On 6/28/23 7:00?PM, Dave Chinner wrote:
> > On Wed, Jun 28, 2023 at 07:50:18PM -0400, Kent Overstreet wrote:
> >> On Wed, Jun 28, 2023 at 05:14:09PM -0600, Jens Axboe wrote:
> >>> On 6/28/23 4:55?PM, Kent Overstreet wrote:
> >>>>> But it's not aio (or io_uring or whatever), it's simply the fact that
> >>>>> doing an fput() from an exiting task (for example) will end up being
> >>>>> done async. And hence waiting for task exits is NOT enough to ensure
> >>>>> that all file references have been released.
> >>>>>
> >>>>> Since there are a variety of other reasons why a mount may be pinned and
> >>>>> fail to umount, perhaps it's worth considering that changing this
> >>>>> behavior won't buy us that much. Especially since it's been around for
> >>>>> more than 10 years:
> >>>>
> >>>> Because it seems that before io_uring the race was quite a bit harder to
> >>>> hit - I only started seeing it when things started switching over to
> >>>> io_uring. generic/388 used to pass reliably for me (pre backpointers),
> >>>> now it doesn't.
> >>>
> >>> I literally just pasted a script that hits it in one second with aio. So
> >>> maybe generic/388 doesn't hit it as easily, but it's surely TRIVIAL to
> >>> hit with aio. As demonstrated. The io_uring is not hard to bring into
> >>> parity on that front, here's one I posted earlier today for 6.5:
> >>>
> >>> https://lore.kernel.org/io-uring/20230628170953.952923-4-axboe@kernel.dk/
> >>>
> >>> Doesn't change the fact that you can easily hit this with io_uring or
> >>> aio, and probably more things too (didn't look any further). Is it a
> >>> realistic thing outside of funky tests? Probably not really, or at least
> >>> if those guys hit it they'd probably have the work-around hack in place
> >>> in their script already.
> >>>
> >>> But the fact is that it's been around for a decade. It's somehow a lot
> >>> easier to hit with bcachefs than XFS, which may just be because the
> >>> former has a bunch of workers and this may be deferring the delayed fput
> >>> work more. Just hand waving.
> >>
> >> Not sure what you're arguing here...?
> >>
> >> We've had a long standing bug, it's recently become much easier to hit
> >> (for multiple reasons); we seem to be in agreement on all that. All I'm
> >> saying is that the existence of that bug previously is not reason to fix
> >> it now.
> > 
> > I agree with Kent here  - the kernel bug needs to be fixed
> > regardless of how long it has been around. Blaming the messenger
> > (userspace, fstests, etc) and saying it should work around a
> > spurious, unpredictable, undesirable and user-undebuggable kernel
> > behaviour is not an acceptible solution here...
> 
> Not sure why you both are putting words in my mouth, I've merely been
> arguing pros and cons and the impact of this. I even linked the io_uring
> addition for ensuring that side will work better once the deferred fput
> is sorted out. I didn't like the idea of fixing this through umount, and
> even outlined how it could be fixed properly by ensuring we flush
> per-task deferred puts on task exit.
> 
> Do I think it's a big issue? Not at all, because a) nobody has reported
> it until now, and b) it's kind of a stupid case. If we can fix it with

Agreed.

> minimal impact, should we? Yep. Particularly as the assumptions stated
> in the original commit I referenced were not even valid back then.

There seems to be a wild misconception here which frankly is very
concering. Afaik, it is absolutely not the case that an fput() from an
exiting task ends up in delayed_work(). But I'm happy to be convinced
otherwise.

But thinking about it for more than a second, it would mean that __every
single task__ that passes through do_exit() would punt all its files to
delayed_fput() for closing. __Every single task on the system__.

What sort of DOS vector do people think we built into the kernel's exit
path? Hundreds or thousands of systemd services can have thousands of
fds open and somehow we punt them all to delayed_fput() when they get
killed, shutdown, or exit?

do_exit()
-> io_uring_files_cancel() /* can register task work */
-> exit_files()            /* can register task work */
-> exit_task_work()
   -> task_work_run()      /* run queued task work and when done set &work_exited sentinel */

Only after exit_task_work() is called do we need to rely on
delayed_fput(). But all files in the tasks fd table will have already
been registered for cleaned up in exit_files() via task work if that
task does indeed hold the last reference.

Unless, we're in an interrupt context or we're dealing with a
PF_KTHREAD...

So, generic/388 calls fsstress. If aio and/or io_uring is present
fsstress will be linked against aio/io_uring and will execute these
codepaths in its default invocation.

Compile out the aio and io_uring support, register a kretprobe via
bpftrace and snoop on how many times delayed_fput() is called when the
test is run with however many threads you want: absolutely not a single
time. Every SIGKILLed task goes through do_exit(), exit_files(),
registers their fputs as task work and then calls exit_task_work(), runs
task work, then disables task work and finally dies peacefully.

Now compile in aio and io_uring support, register a kretprobe via
bpftrace and snoop on how many times delayed_fput() is called and see
frequent delayed_fput() calls for aio and less frequent delayed_fput()
calls for io_uring.

Taking aio as an example, if we SIGKILL the last userspace process with
the aio fd it will exit and at some point the kernel will hit exit_aio()
on last __mmput():

do_exit()
-> exit_mm()
   -> mmput() /* mm could be pinned by some crap somewhere */
      -> exit_aio()
-> io_uring_files_cancel() /* can use task work */
-> exit_files()            /* can use task work */
-> exit_task_work()	   /* no more task work after that */
   -> task_work_run()      /* run queued task work and only when done set &work_exited sentinel */

If there are any outstanding io requests that haven't been completed
then aio will cancel them and punt them onto the system work queue.
Which is, surprise, a PF_KTHREAD. So then delayed_fput() is hit.

io_uring hits this less frequently but it does punt work to a kthread
via io_fallback_tw() iirc if the current task is PF_EXITING and so uses
delayed_fput() in these scenarios.

So this is async io punting to a kthread for cleanup _explicitly_ that's
causing issues here and it is an async io problem.

Why is this racy? For this to become a problem delayed_fput() work
must've been registered plus the delayed_fput() being called from the
system workqueue must take longer than the task work by all of the other
exiting threads.

We give zero f***s about legacy aio - Heidi meme-style. Not a single
line of code in the VFS will be complicated because of this legacy
cruft that everyone hates with a passion. Ultimately it's even why we
ended up with the nice io_uring io_worker model.

And io_uring - Jens can correct me - can probably be improved to rely
task work even if the task is PF_EXITING as long as exit_task_work()
hasn't been called for that task which I reckon it hasn't. So probably
how io_uring cancels work in io_uring_files_cancel() needs to be tweaked
if that really is an issue.

But hard NAK on fiddling with umount for any of this. Umount has never
and will never give any guarantee that a superblock is gone when it
returns. Even if it succeeds and returns it doesn't mean that the
superblock has gone away and that the filesystem can be mounted again
fresh.

Bind mounts, mount namespaces, mount propagation, independent of someone
pinning files in a given mount or kthread based async io fput cleaning
make this completely meaningless. We can't guarantee that and we will
absolutely not get in the business of that in the umount code.

If this generic/388 test is to be reliable rn in the face of async io,
the synchronization method via fsnotify that allows you get notified
about superblock destruction should be used.

Kent Overstreet June 29, 2023, 2:17 p.m. UTC | #45

On Thu, Jun 29, 2023 at 01:18:11PM +0200, Christian Brauner wrote:
> There seems to be a wild misconception here which frankly is very
> concering. Afaik, it is absolutely not the case that an fput() from an
> exiting task ends up in delayed_work(). But I'm happy to be convinced
> otherwise.

I already explained the real issue - it's fput() from an AIO completion,
because that has no association with task it was done on behalf of.

Kent Overstreet June 29, 2023, 3:31 p.m. UTC | #46

On Thu, Jun 29, 2023 at 01:18:11PM +0200, Christian Brauner wrote:
> On Wed, Jun 28, 2023 at 07:33:18PM -0600, Jens Axboe wrote:
> > On 6/28/23 7:00?PM, Dave Chinner wrote:
> > > On Wed, Jun 28, 2023 at 07:50:18PM -0400, Kent Overstreet wrote:
> > >> On Wed, Jun 28, 2023 at 05:14:09PM -0600, Jens Axboe wrote:
> > >>> On 6/28/23 4:55?PM, Kent Overstreet wrote:
> > >>>>> But it's not aio (or io_uring or whatever), it's simply the fact that
> > >>>>> doing an fput() from an exiting task (for example) will end up being
> > >>>>> done async. And hence waiting for task exits is NOT enough to ensure
> > >>>>> that all file references have been released.
> > >>>>>
> > >>>>> Since there are a variety of other reasons why a mount may be pinned and
> > >>>>> fail to umount, perhaps it's worth considering that changing this
> > >>>>> behavior won't buy us that much. Especially since it's been around for
> > >>>>> more than 10 years:
> > >>>>
> > >>>> Because it seems that before io_uring the race was quite a bit harder to
> > >>>> hit - I only started seeing it when things started switching over to
> > >>>> io_uring. generic/388 used to pass reliably for me (pre backpointers),
> > >>>> now it doesn't.
> > >>>
> > >>> I literally just pasted a script that hits it in one second with aio. So
> > >>> maybe generic/388 doesn't hit it as easily, but it's surely TRIVIAL to
> > >>> hit with aio. As demonstrated. The io_uring is not hard to bring into
> > >>> parity on that front, here's one I posted earlier today for 6.5:
> > >>>
> > >>> https://lore.kernel.org/io-uring/20230628170953.952923-4-axboe@kernel.dk/
> > >>>
> > >>> Doesn't change the fact that you can easily hit this with io_uring or
> > >>> aio, and probably more things too (didn't look any further). Is it a
> > >>> realistic thing outside of funky tests? Probably not really, or at least
> > >>> if those guys hit it they'd probably have the work-around hack in place
> > >>> in their script already.
> > >>>
> > >>> But the fact is that it's been around for a decade. It's somehow a lot
> > >>> easier to hit with bcachefs than XFS, which may just be because the
> > >>> former has a bunch of workers and this may be deferring the delayed fput
> > >>> work more. Just hand waving.
> > >>
> > >> Not sure what you're arguing here...?
> > >>
> > >> We've had a long standing bug, it's recently become much easier to hit
> > >> (for multiple reasons); we seem to be in agreement on all that. All I'm
> > >> saying is that the existence of that bug previously is not reason to fix
> > >> it now.
> > > 
> > > I agree with Kent here  - the kernel bug needs to be fixed
> > > regardless of how long it has been around. Blaming the messenger
> > > (userspace, fstests, etc) and saying it should work around a
> > > spurious, unpredictable, undesirable and user-undebuggable kernel
> > > behaviour is not an acceptible solution here...
> > 
> > Not sure why you both are putting words in my mouth, I've merely been
> > arguing pros and cons and the impact of this. I even linked the io_uring
> > addition for ensuring that side will work better once the deferred fput
> > is sorted out. I didn't like the idea of fixing this through umount, and
> > even outlined how it could be fixed properly by ensuring we flush
> > per-task deferred puts on task exit.
> > 
> > Do I think it's a big issue? Not at all, because a) nobody has reported
> > it until now, and b) it's kind of a stupid case. If we can fix it with
> 
> Agreed.

yeah, the rest of this email that I snipped is _severely_ confused about
what is going on here.

Look, the main thing I want to say is - I'm not at all impressed by this
continual evasiveness from you and Jens. It's a bug, it needs to be
fixed.

We are engineers. It is our literal job to do the hard work and solve
the hard problems, and leave behind a system more robust and more
reliable for the people who come after us to use.

Not to kick the can down the line and leave lurking landmines in the
form of "oh you just have to work around this like x..."

Christian Brauner June 30, 2023, 9:40 a.m. UTC | #47

On Thu, Jun 29, 2023 at 11:31:09AM -0400, Kent Overstreet wrote:
> On Thu, Jun 29, 2023 at 01:18:11PM +0200, Christian Brauner wrote:
> > On Wed, Jun 28, 2023 at 07:33:18PM -0600, Jens Axboe wrote:
> > > On 6/28/23 7:00?PM, Dave Chinner wrote:
> > > > On Wed, Jun 28, 2023 at 07:50:18PM -0400, Kent Overstreet wrote:
> > > >> On Wed, Jun 28, 2023 at 05:14:09PM -0600, Jens Axboe wrote:
> > > >>> On 6/28/23 4:55?PM, Kent Overstreet wrote:
> > > >>>>> But it's not aio (or io_uring or whatever), it's simply the fact that
> > > >>>>> doing an fput() from an exiting task (for example) will end up being
> > > >>>>> done async. And hence waiting for task exits is NOT enough to ensure
> > > >>>>> that all file references have been released.
> > > >>>>>
> > > >>>>> Since there are a variety of other reasons why a mount may be pinned and
> > > >>>>> fail to umount, perhaps it's worth considering that changing this
> > > >>>>> behavior won't buy us that much. Especially since it's been around for
> > > >>>>> more than 10 years:
> > > >>>>
> > > >>>> Because it seems that before io_uring the race was quite a bit harder to
> > > >>>> hit - I only started seeing it when things started switching over to
> > > >>>> io_uring. generic/388 used to pass reliably for me (pre backpointers),
> > > >>>> now it doesn't.
> > > >>>
> > > >>> I literally just pasted a script that hits it in one second with aio. So
> > > >>> maybe generic/388 doesn't hit it as easily, but it's surely TRIVIAL to
> > > >>> hit with aio. As demonstrated. The io_uring is not hard to bring into
> > > >>> parity on that front, here's one I posted earlier today for 6.5:
> > > >>>
> > > >>> https://lore.kernel.org/io-uring/20230628170953.952923-4-axboe@kernel.dk/
> > > >>>
> > > >>> Doesn't change the fact that you can easily hit this with io_uring or
> > > >>> aio, and probably more things too (didn't look any further). Is it a
> > > >>> realistic thing outside of funky tests? Probably not really, or at least
> > > >>> if those guys hit it they'd probably have the work-around hack in place
> > > >>> in their script already.
> > > >>>
> > > >>> But the fact is that it's been around for a decade. It's somehow a lot
> > > >>> easier to hit with bcachefs than XFS, which may just be because the
> > > >>> former has a bunch of workers and this may be deferring the delayed fput
> > > >>> work more. Just hand waving.
> > > >>
> > > >> Not sure what you're arguing here...?
> > > >>
> > > >> We've had a long standing bug, it's recently become much easier to hit
> > > >> (for multiple reasons); we seem to be in agreement on all that. All I'm
> > > >> saying is that the existence of that bug previously is not reason to fix
> > > >> it now.
> > > > 
> > > > I agree with Kent here  - the kernel bug needs to be fixed
> > > > regardless of how long it has been around. Blaming the messenger
> > > > (userspace, fstests, etc) and saying it should work around a
> > > > spurious, unpredictable, undesirable and user-undebuggable kernel
> > > > behaviour is not an acceptible solution here...
> > > 
> > > Not sure why you both are putting words in my mouth, I've merely been
> > > arguing pros and cons and the impact of this. I even linked the io_uring
> > > addition for ensuring that side will work better once the deferred fput
> > > is sorted out. I didn't like the idea of fixing this through umount, and
> > > even outlined how it could be fixed properly by ensuring we flush
> > > per-task deferred puts on task exit.
> > > 
> > > Do I think it's a big issue? Not at all, because a) nobody has reported
> > > it until now, and b) it's kind of a stupid case. If we can fix it with
> > 
> > Agreed.
> 
> yeah, the rest of this email that I snipped is _severely_ confused about
> what is going on here.
> 
> Look, the main thing I want to say is - I'm not at all impressed by this
> continual evasiveness from you and Jens. It's a bug, it needs to be
> fixed.
> 
> We are engineers. It is our literal job to do the hard work and solve
> the hard problems, and leave behind a system more robust and more
> reliable for the people who come after us to use.
> 
> Not to kick the can down the line and leave lurking landmines in the
> form of "oh you just have to work around this like x..."

We're all not very impressed with that's going on here. I think everyone
has made that pretty clear.

It's worrying that this reply is so quickly and happily turning to
"I'm a real engineer" and "you're confused" tropes and then isn't even
making a clear point. Going forward this should stop otherwise I'll
cease replying.

Nothing I said was confused. The discussion was initially trying to fix
this in umount and we're not going to fix async aio behavior in umount.

My earlier mail clearly said that io_uring can be changed by Jens pretty
quickly to not cause such test failures.

But there's a trade-off to be considered where we have to introduce new
sensitive and complicated file cleanup code for the sake of the legacy
aio api that even the manpage marks as incomplete and buggy. And all for
an issue that was only ever found out in a test and for behavior that's
existed since the dawn of time.

"We're real engineers" is not an argument for that trade off being
sensible.

Kent Overstreet July 6, 2023, 3:20 p.m. UTC | #48

On Fri, Jun 30, 2023 at 11:40:32AM +0200, Christian Brauner wrote:
> We're all not very impressed with that's going on here. I think everyone
> has made that pretty clear.
> 
> It's worrying that this reply is so quickly and happily turning to
> "I'm a real engineer" and "you're confused" tropes and then isn't even
> making a clear point. Going forward this should stop otherwise I'll
> cease replying.
>
> Nothing I said was confused. The discussion was initially trying to fix
> this in umount and we're not going to fix async aio behavior in umount.

Christain, why on earth would we be trying to fix this in umount? All
you posted was a stack trace and something handwavy about how fixing it
in umount would be hard, and yes it would be! That's crazy!

This is a basic lifetime issue, where we just need to make sure that
refcounts are getting released at the appropriate place and not being
delayed for arbitrarily long (i.e. the global delayed fput list, which
honestly we should probably try to get rid of).

Furthermore, when issues with fput have caused umount to fail in the
past it's always been considered a bug - see the addition of
__fput_sync(), if you do some searching you should be able to find
multiple patches where this has been dealt with.

> My earlier mail clearly said that io_uring can be changed by Jens pretty
> quickly to not cause such test failures.

Jens posted a fix that didn't actually fix anything, and after that it
seemed neither of you were interested in actually fixing this. So based
on that, maybe we need to consider switching fstests back to AIO just so
we can get work done...

Kent Overstreet July 6, 2023, 3:56 p.m. UTC | #49

On Mon, Jun 26, 2023 at 05:47:01PM -0400, Kent Overstreet wrote:
> Hi Linus,
> 
> Here it is, the bcachefs pull request. For brevity the list of patches
> below is only the initial part of the series, the non-bcachefs prep
> patches and the first bcachefs patch, but the diffstat is for the entire
> series.
> 
> Six locks has all the changes you suggested, text size went down
> significantly. If you'd still like this to see more review from the
> locking people, I'm not against them living in fs/bcachefs/ as an
> interim; perhaps Dave could move them back to kernel/locking when he
> starts using them or when locking people have had time to look at them -
> I'm just hoping for this to not block the merge.
> 
> Recently some people have expressed concerns about "not wanting a repeat
> of ntfs3" - from what I understand the issue there was just severe
> buggyness, so perhaps showing the bcachefs automated test results will
> help with that:
> 
>   https://evilpiepirate.org/~testdashboard/ci
> 
> The main bcachefs branch runs fstests and my own test suite in several
> varations, including lockdep+kasan, preempt, and gcov (we're at 82% line
> coverage); I'm not currently seeing any lockdep or kasan splats (or
> panics/oopses, for that matter).
> 
> (Worth noting the bug causing the most test failures by a wide margin is
> actually an io_uring bug that causes random umount failures in shutdown
> tests. Would be great to get that looked at, it doesn't just affect
> bcachefs).
> 
> Regarding feature status - most features are considered stable and ready
> for use, snapshots and erasure coding are both nearly there. But a
> filesystem on this scale is a massive project, adequately conveying the
> status of every feature would take at least a page or two.
> 
> We may want to mark it as EXPERIMENTAL for a few releases, I haven't
> done that as yet. (I wouldn't consider single device without snapshots
> to be experimental, but - given that the number of users and bug reports
> is about to shoot up, perhaps I should...).

Restarting the discussion after the holiday weekend, hoping to get
something more substantive going:

Hoping to get:
 - Thoughts from people who have been following bcachefs development,
   and people who have looked at the code
 - Continuation of the LSF discussion - maybe some people could repeat
   here what they said there (re: code review, iomap, etc.)
 - Any concerns about how this might impact the rest of the kernel, or
   discussion about what impact merging a new filesystem is likely to
   have on other people's work

AFAIK the only big ask that hasn't happened yet is better documentation:
David Howells wanted (better) a man page, which is definitely something
that needs to happen but it'll be some months before I'm back to working
on documentation - I'm happy to share my current list of priorities if
that would be helpful.

In the meantime, the Latex principles of operation is reasonably up to
date (and I intend to greatly expand the sections on on disk data
structures, I think that'll be great reference documentation for
developers getting up to speed on the code)

https://bcachefs.org/bcachefs-principles-of-operation.pdf

I feel that bcachefs is in a pretty mature state at this point, but it's
also _huge_, which is a bit different than e.g. the btrfs merger; it's
hard to know where to start to get a meaninful discussion/review process
going.

Patch bombing the mailing list with 90k loc is clearly not going to be
productive, which is why I've been trying to talk more about development
process and status - but all suggestions and feedback are welcome.

Cheers,
Kent

Jens Axboe July 6, 2023, 4:26 p.m. UTC | #50

On 7/6/23 9:20?AM, Kent Overstreet wrote:
>> My earlier mail clearly said that io_uring can be changed by Jens pretty
>> quickly to not cause such test failures.
> 
> Jens posted a fix that didn't actually fix anything, and after that it
> seemed neither of you were interested in actually fixing this. So
> based on that, maybe we need to consider switching fstests back to AIO
> just so we can get work done...

Yeah let's keep misrepresenting... I already showed how to hit this
easily with aio, and you said you'd fix aio. But nothing really happened
there, unsurprisingly.

You do what you want, as per usual these threads just turn into an
unproductive (and waste of time) shit show. Muted on my end from now on.

Kent Overstreet July 6, 2023, 4:34 p.m. UTC | #51

On Thu, Jul 06, 2023 at 10:26:34AM -0600, Jens Axboe wrote:
> On 7/6/23 9:20?AM, Kent Overstreet wrote:
> >> My earlier mail clearly said that io_uring can be changed by Jens pretty
> >> quickly to not cause such test failures.
> > 
> > Jens posted a fix that didn't actually fix anything, and after that it
> > seemed neither of you were interested in actually fixing this. So
> > based on that, maybe we need to consider switching fstests back to AIO
> > just so we can get work done...
> 
> Yeah let's keep misrepresenting... I already showed how to hit this
> easily with aio, and you said you'd fix aio. But nothing really happened
> there, unsurprisingly.

Jens, your test case showing that this happens on aio too was
appreciated: I was out of town for the holiday weekend, and I'm just now
back home catching up and fixing your test case is the first thing I'm
working on.

But like I said, this wasn't causing test failures when we were using
AIO, it's only since we switched to io_uring that this has become an
issue, and I'm not the only one telling you this is an issue, so ball
very much in your court.

> You do what you want, as per usual these threads just turn into an
> unproductive (and waste of time) shit show. Muted on my end from now on.

Ok.

Josef Bacik July 6, 2023, 4:40 p.m. UTC | #52

On Thu, Jul 06, 2023 at 11:56:02AM -0400, Kent Overstreet wrote:
> On Mon, Jun 26, 2023 at 05:47:01PM -0400, Kent Overstreet wrote:
> > Hi Linus,
> > 
> > Here it is, the bcachefs pull request. For brevity the list of patches
> > below is only the initial part of the series, the non-bcachefs prep
> > patches and the first bcachefs patch, but the diffstat is for the entire
> > series.
> > 
> > Six locks has all the changes you suggested, text size went down
> > significantly. If you'd still like this to see more review from the
> > locking people, I'm not against them living in fs/bcachefs/ as an
> > interim; perhaps Dave could move them back to kernel/locking when he
> > starts using them or when locking people have had time to look at them -
> > I'm just hoping for this to not block the merge.
> > 
> > Recently some people have expressed concerns about "not wanting a repeat
> > of ntfs3" - from what I understand the issue there was just severe
> > buggyness, so perhaps showing the bcachefs automated test results will
> > help with that:
> > 
> >   https://evilpiepirate.org/~testdashboard/ci
> > 
> > The main bcachefs branch runs fstests and my own test suite in several
> > varations, including lockdep+kasan, preempt, and gcov (we're at 82% line
> > coverage); I'm not currently seeing any lockdep or kasan splats (or
> > panics/oopses, for that matter).
> > 
> > (Worth noting the bug causing the most test failures by a wide margin is
> > actually an io_uring bug that causes random umount failures in shutdown
> > tests. Would be great to get that looked at, it doesn't just affect
> > bcachefs).
> > 
> > Regarding feature status - most features are considered stable and ready
> > for use, snapshots and erasure coding are both nearly there. But a
> > filesystem on this scale is a massive project, adequately conveying the
> > status of every feature would take at least a page or two.
> > 
> > We may want to mark it as EXPERIMENTAL for a few releases, I haven't
> > done that as yet. (I wouldn't consider single device without snapshots
> > to be experimental, but - given that the number of users and bug reports
> > is about to shoot up, perhaps I should...).
> 
> Restarting the discussion after the holiday weekend, hoping to get
> something more substantive going:
> 
> Hoping to get:
>  - Thoughts from people who have been following bcachefs development,
>    and people who have looked at the code
>  - Continuation of the LSF discussion - maybe some people could repeat
>    here what they said there (re: code review, iomap, etc.)
>  - Any concerns about how this might impact the rest of the kernel, or
>    discussion about what impact merging a new filesystem is likely to
>    have on other people's work
> 
> AFAIK the only big ask that hasn't happened yet is better documentation:
> David Howells wanted (better) a man page, which is definitely something
> that needs to happen but it'll be some months before I'm back to working
> on documentation - I'm happy to share my current list of priorities if
> that would be helpful.
> 
> In the meantime, the Latex principles of operation is reasonably up to
> date (and I intend to greatly expand the sections on on disk data
> structures, I think that'll be great reference documentation for
> developers getting up to speed on the code)
> 
> https://bcachefs.org/bcachefs-principles-of-operation.pdf
> 
> I feel that bcachefs is in a pretty mature state at this point, but it's
> also _huge_, which is a bit different than e.g. the btrfs merger; it's
> hard to know where to start to get a meaninful discussion/review process
> going.
> 
> Patch bombing the mailing list with 90k loc is clearly not going to be
> productive, which is why I've been trying to talk more about development
> process and status - but all suggestions and feedback are welcome.

I've been watching this from the sidelines sort of busy with other things, but I
realize that comments I made at LSFMMBPF have been sort of taken as the gospel
truth and I want to clear some of that up.

I said this at LSFMMBPF, and I haven't said it on list before so I'll repeat it
here.

I'm of the opinion that me and any other outsider reviewing the bcachefs code in
bulk is largely useless.  I could probably do things like check for locking
stuff and other generic things.

You have patches that are outside of fs/bcachefs.  Get those merged and then do
a pull with just fs/bcachefs, because again posting 90k loc is going to be
unwieldy and the quality of review just simply will not make a difference.

Alternatively rework your code to not have any dependencies outside of
fs/bcachefs.  This is what btrfs did.  That merge didn't touch anything outside
of fs/btrfs.

This merge attempt has gone off the rails, for what appears to be a few common
things.

1) The external dependencies.  There's a reason I was really specific about what
I said at LSFMMBPF, both this year and in 2022.  Get these patches merged first,
the rest will be easier.  You are burning a lot of good will being combative
with people over these dependencies.  This is not the hill to die on.  You want
bcachefs in the kernel and to get back to bcachefs things.  Make the changes you
need to make to get these dependencies in, or simply drop the need for them and
come back to it later after bcachefs is merged.

2) We already have recent examples of merge and disappear.  Yes of course you've
been around for a long time, you aren't the NTFS developers.  But as you point
out it's 90k of code.  When btrfs was merged there were 3 large contributors,
Chris, myself, and Yanzheng.  If Chris got hit by a bus we could still drive the
project forward.  Can the same be said for bachefs?  I know others have chimed
in and done some stuff, but as it's been stated elsewhere it would be good to
have somebody else in the MAINTAINERS file with you.

I am really, really wanting you to succeed here Kent.  If the general consensus
is you need to have some idiot review fs/bcachefs I will happily carve out some
time and dig in.

At this point however it's time to be pragmatic.  Stop dying on every hill, it's
not worth it.  Ruthlessly prioritize and do what needs to be done to get this
thing merged.  Christian saying he's almost ready to stop replying should be a
wakeup call that your approach is not working.  Thanks,

Josef

Kent Overstreet July 6, 2023, 5:38 p.m. UTC | #53

On Thu, Jul 06, 2023 at 12:40:55PM -0400, Josef Bacik wrote:
> I've been watching this from the sidelines sort of busy with other things, but I
> realize that comments I made at LSFMMBPF have been sort of taken as the gospel
> truth and I want to clear some of that up.
> 
> I said this at LSFMMBPF, and I haven't said it on list before so I'll repeat it
> here.
> 
> I'm of the opinion that me and any other outsider reviewing the bcachefs code in
> bulk is largely useless.  I could probably do things like check for locking
> stuff and other generic things.

Yeah, agreed. And the generic things - that's what we've got automated
testing for; there's a reason I've been putting so much effort into
automated testing over (especially) the past year.

> You have patches that are outside of fs/bcachefs.  Get those merged and then do
> a pull with just fs/bcachefs, because again posting 90k loc is going to be
> unwieldy and the quality of review just simply will not make a difference.
>
> Alternatively rework your code to not have any dependencies outside of
> fs/bcachefs.  This is what btrfs did.  That merge didn't touch anything outside
> of fs/btrfs.

We've had other people saying, at multiple times in the past, that
patches that are only needed for bcachefs should be part of the initial
pull instead of going in separately.

I've already cut down the non-bcachefs pull quite a bit, even to the
point of making non-ideal engineering choices, and if I have to cut it
down more it's going to mean more ugly choices.

> This merge attempt has gone off the rails, for what appears to be a few common
> things.
> 
> 1) The external dependencies.  There's a reason I was really specific about what
> I said at LSFMMBPF, both this year and in 2022.  Get these patches merged first,
> the rest will be easier.  You are burning a lot of good will being combative
> with people over these dependencies.  This is not the hill to die on.  You want
> bcachefs in the kernel and to get back to bcachefs things.  Make the changes you
> need to make to get these dependencies in, or simply drop the need for them and
> come back to it later after bcachefs is merged.

Look, I'm not at all trying to be combative, I'm just trying to push
things forward.

The one trainwreck-y thread was regarding vmalloc_exec(), and posting
that patch needed to happen in order to figure out what was even going
to happen regarding the dynamic codegen going forward. It's been dropped
from the initial pull, and dynamic codegen is going to wait on a better
executable memory allocation API.

(and yes, that thread _was_ a trainwreck; it's not good when you have
maintainers claiming endlessly that something is broken and making
arguments to authority but _not able to explain why_. The thread on the
new executable memory allocator still needs something more concrete on
the issues with speculative execution from Andy or someone else).

Let me just lay out the non-bcachefs dependencies:

 - two lockdep patches: these could technically be dropped from the
   series, but that would mean dropping lockdep support entirely for
   btree node locks, and even Linus has said we need to get rid of
   lockdep_no_validate_class so I'm hoping to avoid that.

 - six locks: this shouldn't be blocking, we can move them to
   fs/bcachefs/ if Linus still feels they need more review - but Dave
   Chinner was wanting them and the locking people disliked exporting
   osq_lock so that's why I have them in kernel/locking.

 - mean_and_variance: this is some statistics library code that computes
   mean and standard deviation for time samples, both standard and
   exponentially weighted. Again, bcachefs is the first user so this
   pull request is the proper place for this code, and I'm intending to
   convert bcache to this code as well as use it for other kernel wide
   latency tracking (which I demod at LSF awhile back; I'll be posting
   it again once code tagging is upstreamed as part of the memory
   allocation work Suren and I are doing).

 - task_struct->faults_disabled_mapping: this adds a task_struct member
   that makes it possible to do strict page cache coherency.

   This is something I intend to push into the VFS, but it's going to be
   a big project - it needs a new type of lock (the one in bcachefs is
   good enough for an initial implementation, but the real version
   probably needs priority inheritence and other things). In the
   meantime, I've thoroughly documented what's going on and what the
   plan is in the commit message.

 - d_mark_tmpfile(): trivial new helper, from pulling out part of
   d_tmpfile(). We need this because bcachefs handles the nlink count
   for tmpfiles earlier, in the btree transaction.

 - copy_folio_from_iter_atomic(): obvious new helper, other filesystems
   will want this at some point as part of the ongoing folio conversion

 - block layer patches: we have

   - new exports: primarily because bcachefs has its own dio path and
     does not use iomap, also blk_status_to_str() for better error
     messages

   - bio_iov_iter_get_pages() with bio->bi_bdev unset: bcachefs builds
     up bios before we know which block device those bios will be
     issued to.

     There was something thrown out about "bi_bdev being required" - but
     that doesn't make a lot of sense here. The direction in the block
     layer back when I made it sane for stacking block drivers - i.e.
     enabling efficient splitting/cloning of bios - was towards bios
     being more just simple iterators over a scatter/gather list, and
     now we've got iov_iter which can point at a bio/bvec array - moving
     even more in that direction.

     Regardless, this patch is pretty trivial, it's not something that
     commits us to one particular approach. bio_iov_iter_get_pages() is
     here trying to return bios that are aligned to the block device's
     blocksize, but in our case we just want it aligned to the
     filesystem's blocksize.

   - bring back zero_fill_bio_iter() - I originally wrote this,
     Christoph deleted it without checking. It's just a more general
     version of zero_fill_bio().

   - Don't block on s_umount from __invalidate_super: this is a bugfix
     for a deadlock in generic/042 because of how we use sget(), the
     commit message goes into more detail.

     bcachefs doesn't use sget() for mutual exclusion because a) sget()
     is insane, what we really want is the _block device_ to be opened
     exclusively (which we do), and we have our own block device opening
     path - which we need to, as we're a multi device filesystem.

 - generic radix tree fixes: this is just fixes for code I already wrote
   for bcachefs and upstreamed previously, after converting existing
   users of flex-array.

 - move closures to lib/ - this is also code I wrote, now needs to be
   shared with bcache

 - small stuff:
   - export stack_trace_save_stack() - this is used for displaying stack
     traces in debugfs
   - export errname() - better error messages
   - string_get_size() - change it to return number of characters written
   - compiler attributes - add __flatten

If there are objections to any of these patches, please _be specific_.
Please remember that I am also incorporating feedback previous
discussions, and a generic "these patches need to go in separately" is
not something I can do anything with, as explained previously.

> 2) We already have recent examples of merge and disappear.  Yes of course you've
> been around for a long time, you aren't the NTFS developers.  But as you point
> out it's 90k of code.  When btrfs was merged there were 3 large contributors,
> Chris, myself, and Yanzheng.  If Chris got hit by a bus we could still drive the
> project forward.  Can the same be said for bachefs?  I know others have chimed
> in and done some stuff, but as it's been stated elsewhere it would be good to
> have somebody else in the MAINTAINERS file with you.

Yes, the bcachefs project needs to grow in terms of developers. The
unfortunate reality is that right now is a _hard_ time to growing teams
and budgets in this area; it's been an uphill battle.

You, the btrfs developers, got started when Linux filesystem teams were
quite a bit bigger than they are now: I was at Google when Google had a
bunch of people working on ext4, and that was when ZFS had recently come
out and there was recognition that Linux needed an answer to ZFS and you
were able to ride that excitement. It's been a bit harder for me to get
something equally ambitions going, to be honest.

But years ago when I realized I was onto something, I decided this
project was only going to fail if I let it fail - so I'm in it for the
long haul.

Right now what I'm hearing, in particular from Redhat, is that they want
it upstream in order to commit more resources. Which, I know, is not
what kernel people want to hear, but it's the chicken-and-the-egg
situation I'm in.

> I am really, really wanting you to succeed here Kent.  If the general consensus
> is you need to have some idiot review fs/bcachefs I will happily carve out some
> time and dig in.

That would be much appreciated - I'll owe you some beers next time I see
you. But before jumping in, let's see if we can get people who have
already worked with the code to say something.

Something I've done in the past that might be helpful - instead (or in
addition to) having people read code in isolation, perhaps we could do a
group call/meeting where people can ask questions about the code, bring
up design issues they've seen in other filesystems, etc. - I've also
found that kind of setup great for identifying places in the code where
additional documentation would be useful.

Eric Sandeen July 6, 2023, 7:17 p.m. UTC | #54

On 7/6/23 12:38 PM, Kent Overstreet wrote:
> Right now what I'm hearing, in particular from Redhat, is that they want
> it upstream in order to commit more resources. Which, I know, is not
> what kernel people want to hear, but it's the chicken-and-the-egg
> situation I'm in.

I need to temper that a little. Folks in and around filesystems and 
storage at Red Hat find bcachefs to be quite compelling and interesting, 
and we've spent some resources in the past several months to 
investigate, test, benchmark, and even do some bugfixing.

Upstream acceptance is going to be a necessary condition for almost any 
distro to consider shipping or investing significantly in bcachefs. But 
it's not a given that once it's upstream we'll immediately commit more 
resources - I just wanted to clarify that.

It is a tough chicken and egg problem to be sure. That said, I think 
you're right Kent - landing it upstream will quite likely encourage more 
interest, users, and hopefully developers.

Maybe it'd be reasonable to mark bcachefs as EXPERIMENTAL or similar in 
Kconfig, documentation, and printks - it'd give us options in case it 
doesn't attract developers and Kent does get hit by a bus or decide to 
go start a goat farm instead (i.e. in the worst case, it could be 
yanked, having set expectations.)

-Eric

Kent Overstreet July 6, 2023, 7:31 p.m. UTC | #55

On Thu, Jul 06, 2023 at 02:17:29PM -0500, Eric Sandeen wrote:
> On 7/6/23 12:38 PM, Kent Overstreet wrote:
> > Right now what I'm hearing, in particular from Redhat, is that they want
> > it upstream in order to commit more resources. Which, I know, is not
> > what kernel people want to hear, but it's the chicken-and-the-egg
> > situation I'm in.
> 
> I need to temper that a little. Folks in and around filesystems and storage
> at Red Hat find bcachefs to be quite compelling and interesting, and we've
> spent some resources in the past several months to investigate, test,
> benchmark, and even do some bugfixing.
> 
> Upstream acceptance is going to be a necessary condition for almost any
> distro to consider shipping or investing significantly in bcachefs. But it's
> not a given that once it's upstream we'll immediately commit more resources
> - I just wanted to clarify that.

Yeah, I should probably have worder that a bit better. But in the
conversations I've had with people at other companies it does sound like
the interest is there, it's just that filesystem/storage teams are not
so big these days as to support investing in something that is not yet
mainlined.

> It is a tough chicken and egg problem to be sure. That said, I think you're
> right Kent - landing it upstream will quite likely encourage more interest,
> users, and hopefully developers.

Gotta start somewhere :)
 
> Maybe it'd be reasonable to mark bcachefs as EXPERIMENTAL or similar in
> Kconfig, documentation, and printks - it'd give us options in case it
> doesn't attract developers and Kent does get hit by a bus or decide to go
> start a goat farm instead (i.e. in the worst case, it could be yanked,
> having set expectations.)

Yeah, it does need to be marked EXPERIMENTAL initially, regardless -
staged rollout please, not everyone all at once :)

Kent Overstreet July 6, 2023, 8:15 p.m. UTC | #56

On Wed, Jun 28, 2023 at 03:17:43PM -0600, Jens Axboe wrote:
> On 6/28/23 2:44?PM, Jens Axboe wrote:
> > On 6/28/23 11:52?AM, Kent Overstreet wrote:
> >> On Wed, Jun 28, 2023 at 10:57:02AM -0600, Jens Axboe wrote:
> >>> I discussed this with Christian offline. I have a patch that is pretty
> >>> simple, but it does mean that you'd wait for delayed fput flush off
> >>> umount. Which seems kind of iffy.
> >>>
> >>> I think we need to back up a bit and consider if the kill && umount
> >>> really is sane. If you kill a task that has open files, then any fput
> >>> from that task will end up being delayed. This means that the umount may
> >>> very well fail.
> >>>
> >>> It'd be handy if we could have umount wait for that to finish, but I'm
> >>> not at all confident this is a sane solution for all cases. And as
> >>> discussed, we have no way to even identify which files we'd need to
> >>> flush out of the delayed list.
> >>>
> >>> Maybe the test case just needs fixing? Christian suggested lazy/detach
> >>> umount and wait for sb release. There's an fsnotify hook for that,
> >>> fsnotify_sb_delete(). Obviously this is a bit more involved, but seems
> >>> to me that this would be the way to make it more reliable when killing
> >>> of tasks with open files are involved.
> >>
> >> No, this is a real breakage. Any time we introduce unexpected
> >> asynchrony there's the potential for breakage: case in point, there was
> >> a filesystem that made rm asynchronous, then there were scripts out
> >> there that deleted until df showed under some threshold.. whoops...
> > 
> > This is nothing new - any fput done from an exiting task will end up
> > being deferred. The window may be a bit wider now or a bit different,
> > but it's the same window. If an application assumes it can kill && wait
> > on a task and be guaranteed that the files are released as soon as wait
> > returns, it is mistaken. That is NOT the case.
> 
> Case in point, just changed my reproducer to use aio instead of
> io_uring. Here's the full script:
> 
> #!/bin/bash
> 
> DEV=/dev/nvme1n1
> MNT=/data
> ITER=0
> 
> while true; do
> 	echo loop $ITER
> 	sudo mount $DEV $MNT
> 	fio --name=test --ioengine=aio --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --output=/dev/null &
> 	Y=$(($RANDOM % 3))
> 	X=$(($RANDOM % 10))
> 	VAL="$Y.$X"
> 	sleep $VAL
> 	ps -e | grep fio > /dev/null 2>&1
> 	while [ $? -eq 0 ]; do
> 		killall -9 fio > /dev/null 2>&1
> 		echo will wait
> 		wait > /dev/null 2>&1
> 		echo done waiting
> 		ps -e | grep "fio " > /dev/null 2>&1
> 	done
> 	sudo umount /data
> 	if [ $? -ne 0 ]; then
> 		break
> 	fi
> 	((ITER++))
> done
> 
> and if I run that, fails on the first umount attempt in that loop:
> 
> axboe@m1max-kvm ~> bash test2.sh
> loop 0
> will wait
> done waiting
> umount: /data: target is busy.

Your test fails because fio by default spawns off multiple processes,
and just calling wait does not wait for the subprocesses.

When I pass --thread to fio, your test passes.

I have a patch to avoid use of the delayed_fput list in the aio path,
but curiously it seems not to be needed - perhaps there's some other
synchronization I haven't found yet. I'm including the patch below in
case the technique is useful for io_uring:

diff --git a/fs/aio.c b/fs/aio.c
index b3e14a9fe3..00cb953efa 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -211,6 +211,7 @@ struct aio_kiocb {
 						 * for cancellation */
 	refcount_t		ki_refcnt;
 
+	struct task_struct	*ki_task;
 	/*
 	 * If the aio_resfd field of the userspace iocb is not zero,
 	 * this is the underlying eventfd context to deliver events to.
@@ -321,7 +322,7 @@ static void put_aio_ring_file(struct kioctx *ctx)
 		ctx->aio_ring_file = NULL;
 		spin_unlock(&i_mapping->private_lock);
 
-		fput(aio_ring_file);
+		__fput_sync(aio_ring_file);
 	}
 }
 
@@ -1068,6 +1069,7 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx)
 	INIT_LIST_HEAD(&req->ki_list);
 	refcount_set(&req->ki_refcnt, 2);
 	req->ki_eventfd = NULL;
+	req->ki_task = get_task_struct(current);
 	return req;
 }
 
@@ -1104,8 +1106,9 @@ static inline void iocb_destroy(struct aio_kiocb *iocb)
 	if (iocb->ki_eventfd)
 		eventfd_ctx_put(iocb->ki_eventfd);
 	if (iocb->ki_filp)
-		fput(iocb->ki_filp);
+		fput_for_task(iocb->ki_filp, iocb->ki_task);
 	percpu_ref_put(&iocb->ki_ctx->reqs);
+	put_task_struct(iocb->ki_task);
 	kmem_cache_free(kiocb_cachep, iocb);
 }
 
diff --git a/fs/file_table.c b/fs/file_table.c
index 372653b926..137f87f55e 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -367,12 +367,13 @@ EXPORT_SYMBOL_GPL(flush_delayed_fput);
 
 static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput);
 
-void fput(struct file *file)
+void fput_for_task(struct file *file, struct task_struct *task)
 {
 	if (atomic_long_dec_and_test(&file->f_count)) {
-		struct task_struct *task = current;
+		if (!task && likely(!in_interrupt() && !(current->flags & PF_KTHREAD)))
+			task = current;
 
-		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
+		if (task) {
 			init_task_work(&file->f_rcuhead, ____fput);
 			if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME))
 				return;
@@ -388,6 +389,11 @@ void fput(struct file *file)
 	}
 }
 
+void fput(struct file *file)
+{
+	fput_for_task(file, NULL);
+}
+
 /*
  * synchronous analog of fput(); for kernel threads that might be needed
  * in some umount() (and thus can't use flush_delayed_fput() without
@@ -405,6 +411,7 @@ void __fput_sync(struct file *file)
 	}
 }
 
+EXPORT_SYMBOL(fput_for_task);
 EXPORT_SYMBOL(fput);
 EXPORT_SYMBOL(__fput_sync);
 
diff --git a/include/linux/file.h b/include/linux/file.h
index 39704eae83..667a68f477 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -12,7 +12,9 @@
 #include <linux/errno.h>
 
 struct file;
+struct task_struct;
 
+extern void fput_for_task(struct file *, struct task_struct *);
 extern void fput(struct file *);
 
 struct file_operations;

Darrick J. Wong July 6, 2023, 9:19 p.m. UTC | #57

On Thu, Jul 06, 2023 at 01:38:19PM -0400, Kent Overstreet wrote:
> On Thu, Jul 06, 2023 at 12:40:55PM -0400, Josef Bacik wrote:
> > I've been watching this from the sidelines sort of busy with other things, but I
> > realize that comments I made at LSFMMBPF have been sort of taken as the gospel
> > truth and I want to clear some of that up.
> > 
> > I said this at LSFMMBPF, and I haven't said it on list before so I'll repeat it
> > here.
> > 
> > I'm of the opinion that me and any other outsider reviewing the bcachefs code in
> > bulk is largely useless.  I could probably do things like check for locking
> > stuff and other generic things.
> 
> Yeah, agreed. And the generic things - that's what we've got automated
> testing for; there's a reason I've been putting so much effort into
> automated testing over (especially) the past year.

Woot.  That's more than I can say for ntfs3...

> > You have patches that are outside of fs/bcachefs.  Get those merged and then do
> > a pull with just fs/bcachefs, because again posting 90k loc is going to be
> > unwieldy and the quality of review just simply will not make a difference.
> >
> > Alternatively rework your code to not have any dependencies outside of
> > fs/bcachefs.  This is what btrfs did.  That merge didn't touch anything outside
> > of fs/btrfs.
> 
> We've had other people saying, at multiple times in the past, that
> patches that are only needed for bcachefs should be part of the initial
> pull instead of going in separately.
> 
> I've already cut down the non-bcachefs pull quite a bit, even to the
> point of making non-ideal engineering choices, and if I have to cut it
> down more it's going to mean more ugly choices.

<nod>

> > This merge attempt has gone off the rails, for what appears to be a few common
> > things.
> > 
> > 1) The external dependencies.  There's a reason I was really specific about what
> > I said at LSFMMBPF, both this year and in 2022.  Get these patches merged first,
> > the rest will be easier.  You are burning a lot of good will being combative
> > with people over these dependencies.  This is not the hill to die on.  You want
> > bcachefs in the kernel and to get back to bcachefs things.  Make the changes you
> > need to make to get these dependencies in, or simply drop the need for them and
> > come back to it later after bcachefs is merged.
> 
> Look, I'm not at all trying to be combative, I'm just trying to push
> things forward.
> 
> The one trainwreck-y thread was regarding vmalloc_exec(), and posting
> that patch needed to happen in order to figure out what was even going
> to happen regarding the dynamic codegen going forward. It's been dropped
> from the initial pull, and dynamic codegen is going to wait on a better
> executable memory allocation API.
> 
> (and yes, that thread _was_ a trainwreck; it's not good when you have
> maintainers claiming endlessly that something is broken and making
> arguments to authority but _not able to explain why_. The thread on the
> new executable memory allocator still needs something more concrete on
> the issues with speculative execution from Andy or someone else).

(Honestly I'm glad that's set aside for now, because it seemed like a
giant can of worms for a non-critical optimization.)

> Let me just lay out the non-bcachefs dependencies:
> 
>  - two lockdep patches: these could technically be dropped from the
>    series, but that would mean dropping lockdep support entirely for
>    btree node locks, and even Linus has said we need to get rid of
>    lockdep_no_validate_class so I'm hoping to avoid that.
> 
>  - six locks: this shouldn't be blocking, we can move them to
>    fs/bcachefs/ if Linus still feels they need more review - but Dave
>    Chinner was wanting them and the locking people disliked exporting
>    osq_lock so that's why I have them in kernel/locking.

That's probably ok for now; we don't (AFAIK) have any concrete plan for
deploying sixlocks in xfs at this time.  Even if we did, it's still
going to take another 15 years to review the ~2000 patches of backlog in
djwong/dchinner trees.

>  - mean_and_variance: this is some statistics library code that computes
>    mean and standard deviation for time samples, both standard and
>    exponentially weighted. Again, bcachefs is the first user so this
>    pull request is the proper place for this code, and I'm intending to
>    convert bcache to this code as well as use it for other kernel wide
>    latency tracking (which I demod at LSF awhile back; I'll be posting
>    it again once code tagging is upstreamed as part of the memory
>    allocation work Suren and I are doing).

TBH, so long as bcachefs is the only user of sixlocks and mean/variance,
I don't really care what's in them, though they probably ought to live
in fs/bcachefs/ until a second user arises.

>  - task_struct->faults_disabled_mapping: this adds a task_struct member
>    that makes it possible to do strict page cache coherency.
> 
>    This is something I intend to push into the VFS, but it's going to be
>    a big project - it needs a new type of lock (the one in bcachefs is
>    good enough for an initial implementation, but the real version
>    probably needs priority inheritence and other things). In the
>    meantime, I've thoroughly documented what's going on and what the
>    plan is in the commit message.
> 
>  - d_mark_tmpfile(): trivial new helper, from pulling out part of
>    d_tmpfile(). We need this because bcachefs handles the nlink count
>    for tmpfiles earlier, in the btree transaction.

XFS might want this too, we also handle the nlink count for tmpfiles
earlier, in a transaction, and end up playing stupid games with the
refcount to fit the vfs function:

	if (tmpfile) {
		/*
		 * The VFS requires that any inode fed to d_tmpfile must
		 * have nlink == 1 so that it can decrement the nlink in
		 * d_tmpfile.  However, we created the temp file with
		 * nlink == 0 because we're not allowed to put an inode
		 * with nlink > 0 on the unlinked list.  Therefore we
		 * have to set nlink to 1 so that d_tmpfile can
		 * immediately set it back to zero.
		 */
		set_nlink(inode, 1);
		d_tmpfile(tmpfile, inode);
	}

>  - copy_folio_from_iter_atomic(): obvious new helper, other filesystems
>    will want this at some point as part of the ongoing folio conversion
> 
>  - block layer patches: we have
> 
>    - new exports: primarily because bcachefs has its own dio path and
>      does not use iomap, also blk_status_to_str() for better error
>      messages
> 
>    - bio_iov_iter_get_pages() with bio->bi_bdev unset: bcachefs builds
>      up bios before we know which block device those bios will be
>      issued to.
> 
>      There was something thrown out about "bi_bdev being required" - but
>      that doesn't make a lot of sense here. The direction in the block
>      layer back when I made it sane for stacking block drivers - i.e.
>      enabling efficient splitting/cloning of bios - was towards bios
>      being more just simple iterators over a scatter/gather list, and
>      now we've got iov_iter which can point at a bio/bvec array - moving
>      even more in that direction.
> 
>      Regardless, this patch is pretty trivial, it's not something that
>      commits us to one particular approach. bio_iov_iter_get_pages() is
>      here trying to return bios that are aligned to the block device's
>      blocksize, but in our case we just want it aligned to the
>      filesystem's blocksize.

<shrug> seems fine to me...

>    - bring back zero_fill_bio_iter() - I originally wrote this,
>      Christoph deleted it without checking. It's just a more general
>      version of zero_fill_bio().
>  
>    - Don't block on s_umount from __invalidate_super: this is a bugfix
>      for a deadlock in generic/042 because of how we use sget(), the
>      commit message goes into more detail.

If this is in reference to the earlier subthread about some io_uring
thing causing unmount to hang -- my impressions of that were that yes,
it's a bug, but no, it's not a bug in bcachefs itself.  I also wondered
why (a) that hadn't split out as its own thread; and (b) is this really
a bcachefs blocker?

/me shrugs, been on vacation and in hospitals for the last month or so.

>      bcachefs doesn't use sget() for mutual exclusion because a) sget()
>      is insane, what we really want is the _block device_ to be opened
>      exclusively (which we do), and we have our own block device opening
>      path - which we need to, as we're a multi device filesystem.

...and isn't jan kara already messing with this anyway?

>  - generic radix tree fixes: this is just fixes for code I already wrote
>    for bcachefs and upstreamed previously, after converting existing
>    users of flex-array.
> 
>  - move closures to lib/ - this is also code I wrote, now needs to be
>    shared with bcache

<nod>

>  - small stuff:
>    - export stack_trace_save_stack() - this is used for displaying stack
>      traces in debugfs
>    - export errname() - better error messages
>    - string_get_size() - change it to return number of characters written
>    - compiler attributes - add __flatten
> 
> If there are objections to any of these patches, please _be specific_.
> Please remember that I am also incorporating feedback previous
> discussions, and a generic "these patches need to go in separately" is
> not something I can do anything with, as explained previously.
> 
> > 2) We already have recent examples of merge and disappear.  Yes of course you've
> > been around for a long time, you aren't the NTFS developers.  But as you point
> > out it's 90k of code.  When btrfs was merged there were 3 large contributors,
> > Chris, myself, and Yanzheng.  If Chris got hit by a bus we could still drive the
> > project forward.  Can the same be said for bachefs?

The same can't even be said about ext4 or xfs -- if Ted, Dave, or I
disappeared tomorrow, I predict there would be huge problems within a
month or two.

I'm of two minds here -- I want to say that bcachefs should get merged
because wasting Kent's mind on rebasing out of tree patchsets is totally
stupid and I think I've worked over his QA/CI system enough to trust
that bcachefs isn't a giant nightmare codebase.

OTOH there's so many layers of tiny helper functions and macros that I
have a really hard time making sense of all those pre-bcachefs changes.
That's why I haven't weighed in.  Given all the weird problems we've had
recently with new code colliding badly with under-documented optimized
core code, I'm fearful of touching anything.

> > I know others have chimed
> > in and done some stuff, but as it's been stated elsewhere it would be good to
> > have somebody else in the MAINTAINERS file with you.
> 
> Yes, the bcachefs project needs to grow in terms of developers. The
> unfortunate reality is that right now is a _hard_ time to growing teams
> and budgets in this area; it's been an uphill battle.

Same here.  Sillycon Valley just laid off what, like 300,000 engineers
so they could refocus on "AI" but they can't pay for 30 more people to
work on filesystems??</rant>

> You, the btrfs developers, got started when Linux filesystem teams were
> quite a bit bigger than they are now: I was at Google when Google had a
> bunch of people working on ext4, and that was when ZFS had recently come
> out and there was recognition that Linux needed an answer to ZFS and you
> were able to ride that excitement. It's been a bit harder for me to get
> something equally ambitions going, to be honest.
> 
> But years ago when I realized I was onto something, I decided this
> project was only going to fail if I let it fail - so I'm in it for the
> long haul.
> 
> Right now what I'm hearing, in particular from Redhat, is that they want
> it upstream in order to commit more resources. Which, I know, is not
> what kernel people want to hear, but it's the chicken-and-the-egg
> situation I'm in.

/me suspects mainline merging is necessary but not sufficient -- few
non-developers want to deal with merging an out of tree filesystem, but
that still doesn't mean anyone will commit real engineering resources.

> > I am really, really wanting you to succeed here Kent.  If the general consensus
> > is you need to have some idiot review fs/bcachefs I will happily carve out some
> > time and dig in.
> 
> That would be much appreciated - I'll owe you some beers next time I see
> you. But before jumping in, let's see if we can get people who have
> already worked with the code to say something.
> 
> Something I've done in the past that might be helpful - instead (or in
> addition to) having people read code in isolation, perhaps we could do a
> group call/meeting where people can ask questions about the code, bring
> up design issues they've seen in other filesystems, etc. - I've also
> found that kind of setup great for identifying places in the code where
> additional documentation would be useful.

"At this point I think Kent's QA efforts are at least as good as XFS's,
just merge it and let's move on to the next thing."

--D

Kent Overstreet July 6, 2023, 10:43 p.m. UTC | #58

On Thu, Jul 06, 2023 at 02:19:14PM -0700, Darrick J. Wong wrote:
> TBH, so long as bcachefs is the only user of sixlocks and mean/variance,
> I don't really care what's in them, though they probably ought to live
> in fs/bcachefs/ until a second user arises.

I've been waiting for Linus to weigh in on those (and the rest of the
merge) since he had opinions a few weeks ago, but I have no real
objection there. I'd need to add an export for osq_lock, that's all.

> >  - d_mark_tmpfile(): trivial new helper, from pulling out part of
> >    d_tmpfile(). We need this because bcachefs handles the nlink count
> >    for tmpfiles earlier, in the btree transaction.
> 
> XFS might want this too, we also handle the nlink count for tmpfiles
> earlier, in a transaction, and end up playing stupid games with the
> refcount to fit the vfs function:
> 
> 	if (tmpfile) {
> 		/*
> 		 * The VFS requires that any inode fed to d_tmpfile must
> 		 * have nlink == 1 so that it can decrement the nlink in
> 		 * d_tmpfile.  However, we created the temp file with
> 		 * nlink == 0 because we're not allowed to put an inode
> 		 * with nlink > 0 on the unlinked list.  Therefore we
> 		 * have to set nlink to 1 so that d_tmpfile can
> 		 * immediately set it back to zero.
> 		 */
> 		set_nlink(inode, 1);
> 		d_tmpfile(tmpfile, inode);
> 	}

Yeah, that same game would technically work for bcachefs - but I'm
hoping we can just do the right thing here :)

> >    - Don't block on s_umount from __invalidate_super: this is a bugfix
> >      for a deadlock in generic/042 because of how we use sget(), the
> >      commit message goes into more detail.
> 
> If this is in reference to the earlier subthread about some io_uring
> thing causing unmount to hang -- my impressions of that were that yes,
> it's a bug, but no, it's not a bug in bcachefs itself.  I also wondered
> why (a) that hadn't split out as its own thread; and (b) is this really
> a bcachefs blocker?

No, this is completely unrelated. The io_uring thing hits on
generic/388 (and others) and just causes umount to fail with -EBUSY.
This one is an actual deadlock and it hits every time in generic/042.
It's specific to the loopback device and when it emits certain events,
and it hits every time so I really do need this fix included.

> /me shrugs, been on vacation and in hospitals for the last month or so.
> 
> >      bcachefs doesn't use sget() for mutual exclusion because a) sget()
> >      is insane, what we really want is the _block device_ to be opened
> >      exclusively (which we do), and we have our own block device opening
> >      path - which we need to, as we're a multi device filesystem.
> 
> ...and isn't jan kara already messing with this anyway?

The blkdev_get_handle() patchset? I like that, but I don't think that's
related - if there's something more related to sget() I haven't seen it
yet

> OTOH there's so many layers of tiny helper functions and macros that I
> have a really hard time making sense of all those pre-bcachefs changes.
> That's why I haven't weighed in.  Given all the weird problems we've had
> recently with new code colliding badly with under-documented optimized
> core code, I'm fearful of touching anything.

??? not sure what you're referring to here, are there specific patches
or recent issues you're thinking of?

I don't think any of the non-fs/bcachefs/ patches are remotely tricky
enough for any of this to be a concern.

> > You, the btrfs developers, got started when Linux filesystem teams were
> > quite a bit bigger than they are now: I was at Google when Google had a
> > bunch of people working on ext4, and that was when ZFS had recently come
> > out and there was recognition that Linux needed an answer to ZFS and you
> > were able to ride that excitement. It's been a bit harder for me to get
> > something equally ambitions going, to be honest.
> > 
> > But years ago when I realized I was onto something, I decided this
> > project was only going to fail if I let it fail - so I'm in it for the
> > long haul.
> > 
> > Right now what I'm hearing, in particular from Redhat, is that they want
> > it upstream in order to commit more resources. Which, I know, is not
> > what kernel people want to hear, but it's the chicken-and-the-egg
> > situation I'm in.
> 
> /me suspects mainline merging is necessary but not sufficient -- few
> non-developers want to deal with merging an out of tree filesystem, but
> that still doesn't mean anyone will commit real engineering resources.

Yeah, no doubt it will continue to be an uphill battle. But it's a
necessary step in the right direction, for sure.

> > > I am really, really wanting you to succeed here Kent.  If the general consensus
> > > is you need to have some idiot review fs/bcachefs I will happily carve out some
> > > time and dig in.
> > 
> > That would be much appreciated - I'll owe you some beers next time I see
> > you. But before jumping in, let's see if we can get people who have
> > already worked with the code to say something.
> > 
> > Something I've done in the past that might be helpful - instead (or in
> > addition to) having people read code in isolation, perhaps we could do a
> > group call/meeting where people can ask questions about the code, bring
> > up design issues they've seen in other filesystems, etc. - I've also
> > found that kind of setup great for identifying places in the code where
> > additional documentation would be useful.
> 
> "At this point I think Kent's QA efforts are at least as good as XFS's,
> just merge it and let's move on to the next thing."

high praise :)

Theodore Ts'o July 7, 2023, 2:04 a.m. UTC | #59

On Thu, Jul 06, 2023 at 01:38:19PM -0400, Kent Overstreet wrote:
> You, the btrfs developers, got started when Linux filesystem teams were
> quite a bit bigger than they are now: I was at Google when Google had a
> bunch of people working on ext4, and that was when ZFS had recently come
> out and there was recognition that Linux needed an answer to ZFS and you
> were able to ride that excitement. It's been a bit harder for me to get
> something equally ambitions going, to be honest.

Just to set the historical record straight, I think you're mixing up
two stories here.

*Btrfs* was started while I was at the IBM Linux Technology Center,
and it was because there were folks from more than one companies that
were concerned that there needed to be an answer to ZFS.  IBM hosted
that meeting, but ultimately, never did contribute any developers to
the btrfs effort.  That's because IBM had a fairly cold, hard
examination of what their enterprise customers really wanted, and
would be willing to pay $$$, and the decision was made at a corporate
level (higher up than the Linux Technology Center, although I
participated in the company-wide investigation) that *none* of OS's
that IBM supported (AIX, zOS, Linux, etc.) needed ZFS-like features,
because IBM's customers didn't need them.  The vast majority of what
paying customers' workloads at the time was to run things like
Websphere, and Oracle and DB/2, and these did not need fancy
snapshots.  And things like integrity could be provided at other
layers of the storage stack.

As far as Google was concerned, yes, we had several software engineers
working on ext4, but it had nothing to do with ZFS.  We had a solid
business case for how replacing ext2 with ext4 (in nojournal mode,
since the cluster file system handled data integrity and crash
recovery) would save the company $XXX millions of dollars in storage
TCO (total cost of ownership) dollars per year.

In any case, at neither company was a "sense of excitement" something
which drove the technical decisions.  It was all about Return on
Investment (ROI).  As such, that's driven my bias towards ext4
maintenance.

I view part of my job is finding matches between interesting file
system features that I would find technically interesting, and which
would benefit the general ext4 user base, and specific business cases
that would encourage the investment of several developers on file
system technologies.

Things like case insensitive file names, fscrypt, fsverity, etc.,
where all started *after* I had found a business case that would
interest one or more companies or divisions inside Google to put
people on the project.  Smaller projects can get funded on the
margins, sure.  But for anything big, that might require the focused
attention of one or more developers for a quarter or more, I generally
find the business case first, and often, that will inform the
requirements for the feature.  In other words, not only am I ext4's
maintainer, I'm also its product manager.

Of course, this is not the only way you can drive technology forward.
For example, at Sun Microsystems, ZFS was driven just by the techies,
and initially, they hid the fact that the project was taking place,
not asking the opinion of the finance and sales teams.  And so ZFS had
quite a lot of very innovative technologies that pushed the industry
forward, including inspiring btrfs.  Of course, Sun Microsystems
didn't do all that well financially, until they were forced to sell
themselves to the highest bidder.  So perhaps, it might be that this
particular model is one that other companies, including IBM, Red Hat,
Microsoft, Oracle, Facebook, etc., might choose to avoid emulating.

Cheers,

					- Ted

Christian Brauner July 7, 2023, 8:48 a.m. UTC | #60

> just merge it and let's move on to the next thing."

"and let the block and vfs maintainers and developers deal with the fallout"

is how that reads to others that deal with 65+ filesystems and counting.

The offlist thread that was started by Kent before this pr was sent has
seen people try to outline calmly what problems they currently still
have both maintenance wise and upstreaming wise. And it seems there's
just no way this can go over calmly but instead requires massive amounts
of defensive pushback and grandstanding.

Our main task here is to consider the concerns of people that constantly
review and rework massive amounts of generic code. And I can't in good
conscience see their concerns dismissed with snappy quotes.

I understand the impatience, I understand the excitement, I really do.
But not in this way where core people just drop off because they don't
want to deal with this anymore.

I've spent enough time on this thread.

Kent Overstreet July 7, 2023, 9:18 a.m. UTC | #61

On Fri, Jul 07, 2023 at 10:48:55AM +0200, Christian Brauner wrote:
> > just merge it and let's move on to the next thing."
> 
> "and let the block and vfs maintainers and developers deal with the fallout"
> 
> is how that reads to others that deal with 65+ filesystems and counting.
> 
> The offlist thread that was started by Kent before this pr was sent has
> seen people try to outline calmly what problems they currently still
> have both maintenance wise and upstreaming wise. And it seems there's
> just no way this can go over calmly but instead requires massive amounts
> of defensive pushback and grandstanding.
> 
> Our main task here is to consider the concerns of people that constantly
> review and rework massive amounts of generic code. And I can't in good
> conscience see their concerns dismissed with snappy quotes.
> 
> I understand the impatience, I understand the excitement, I really do.
> But not in this way where core people just drop off because they don't
> want to deal with this anymore.
> 
> I've spent enough time on this thread.

Christain, the hostility I'm reading is in your steady passive
aggressive accusations, and your patronizing attitude. It's not
professional, and it's not called for.

Can we please try to stay focused on the code, and the process, and the
_actual_ concerns?

In that offlist thread, I don't recall much in the way of actual,
concrete concerns. I do recall Christoph doing his usual schpiel; and to
be clear, I cut short my interactions with Christoph because in nearly
15 years of kernel development he's never been anything but hostile to
anything I've posted, and the criticisms he posts tend to be vague and
unaware of the surrounding discussion, not anything actionable.

The most concrete concern from you in that offlist thread was "we don't
want a repeat of ntfs", and when I asked you to elaborate you never
responded.

Huh.

And: this pull request is not some sudden thing, I have been steadily
feeding prep work patches in and having ongoing discussions with other
filesystem people, including presenting at LSF to gather feedback, since
_well_ before you were the VFS maintainer.

If you have anything concrete to share, any concrete concerns you'd like
addressed - please share them! I'd love to work with you.

I don't want the two of us to have a hostile, adversarial relationship;
I appreciate the work you've been doing in the vfs, and I've told you
that in the past.

But it would help things if you would try to work with me, not against
me, and try to understand that there's been past discussions and
consensus that was built before you came along.

Cheers,
Kent

Kent Overstreet July 7, 2023, 9:35 a.m. UTC | #62

On Fri, Jul 07, 2023 at 10:48:55AM +0200, Christian Brauner wrote:
> > just merge it and let's move on to the next thing."
> 
> "and let the block and vfs maintainers and developers deal with the fallout"
> 
> is how that reads to others that deal with 65+ filesystems and counting.
> 
> The offlist thread that was started by Kent before this pr was sent has
> seen people try to outline calmly what problems they currently still
> have both maintenance wise and upstreaming wise. And it seems there's
> just no way this can go over calmly but instead requires massive amounts
> of defensive pushback and grandstanding.
> 
> Our main task here is to consider the concerns of people that constantly
> review and rework massive amounts of generic code. And I can't in good
> conscience see their concerns dismissed with snappy quotes.
> 
> I understand the impatience, I understand the excitement, I really do.
> But not in this way where core people just drop off because they don't
> want to deal with this anymore.
> 
> I've spent enough time on this thread.

Also, if you do feel like coming back to the discussion: I would still
like to hear in more detail about your specific pain points and talk
about what we can do to address them.

I've put a _ton_ of work into test infrastructure over the years, and
it's now scalable enough to handle fstests runs on every filesystem
fstests support - and it'll get you the results in short order.

I've started making the cluster available to other devs, and I'd be
happy to make it available to you as well. Perhaps there's other things
we could do.

Cheers,
Kent

Brian Foster July 7, 2023, 12:18 p.m. UTC | #63

On Thu, Jul 06, 2023 at 01:38:19PM -0400, Kent Overstreet wrote:
> On Thu, Jul 06, 2023 at 12:40:55PM -0400, Josef Bacik wrote:
...
> > I am really, really wanting you to succeed here Kent.  If the general consensus
> > is you need to have some idiot review fs/bcachefs I will happily carve out some
> > time and dig in.
> 
> That would be much appreciated - I'll owe you some beers next time I see
> you. But before jumping in, let's see if we can get people who have
> already worked with the code to say something.
> 

I've been poking at bcachefs for several months or so now. I'm happy to
chime in on my practical experience thus far, though I'm still not
totally clear what folks are looking for on this front, in terms of
actual review. I agree with Josef's sentiment that a thorough code
review of the entire fs is not really practical. I've not done that and
don't plan to in the short term.

As it is, I have been able to dig into various areas of the code, learn
some of the basic principles, diagnose/fix issues and get some of those
fixes merged without too much trouble. IMO, the code is fairly well
organized at a high level, reasonably well documented and
debuggable/supportable. That isn't to say some of those things couldn't
be improved (and I expect they will be), but these are more time and
resource constraints than anything and so I don't see any major red
flags in that regard. Some of my bigger personal gripes would be a lot
of macro code generation stuff makes it a bit harder (but not
impossible) for a novice to come up to speed, and similarly a bit more
introductory/feature level documentation would be useful to help
navigate areas of code without having to rely on Kent as much. The
documentation that is available is still pretty good for gaining a high
level understanding of the fs data structures, though I agree that more
content on things like on-disk format would be really nice.

Functionality wise I think it's inevitable that there will be some
growing pains as user and developer base grows. For that reason I think
having some kind of experimental status for a period of time is probably
the right approach. Most of the issues I've dug into personally have
been corner case type things, but experience shows that these are the
sorts of things that eventually arise with more users. We've also
briefly discussed things like whether bcachefs could take more advantage
of some of the test coverage that btrfs already has in fstests, since
the feature sets should largely overlap. That is TBD, but is something
else that might be a good step towards further proving out reliability.

Related to that, something I'm not sure I've seen described anywhere is
the functional/production status of the filesystem itself (not
necessarily the development status of the various features). For
example, is the filesystem used in production at any level? If so, what
kinds of deployments, workloads and use cases do you know about? How
long have they been in use, etc.? I realize we may not have knowledge or
permission to share details, but any general info about usage in the
wild would be interesting.

The development process is fairly ad hoc, so I suspect that is something
that would have to evolve if this lands upstream. Kent, did you have
thoughts/plans around that? I don't mind contributing reviews where I
can, but that means patches would be posted somewhere for feedback, etc.
I suppose that has potential to slow things down, but also gives people
a chance to see what's happening, review or ask questions, etc., which
is another good way to learn or simply keep up with things.

All in all I pretty much agree with Josef wrt to the merge request. ISTM
the main issues right now are the external dependencies and
development/community situation (i.e. bus factor). As above, I plan to
continue contributions at least in terms of fixes and whatnot so long as
$employer continues to allow me to dedicate at least some time to it and
the community is functional ;), but it's not clear to me if that is
sufficient to address the concerns here. WRT the dependencies, I agree
it makes sense to be deliberate and for anything that is contentious,
either just drop it or lift it into bcachefs for now to avoid the need
to debate on these various fronts in the first place (and simplify the
pull request as much as possible).

With those issues addressed, perhaps it would be helpful if other
interested fs maintainers/devs could chime in with any thoughts on what
they'd want to see in order to ack (but not necessarily "review") a new
filesystem pull request..? I don't have the context of the off list
thread, but from this thread ISTM that perhaps Josef and Darrick are
close to being "soft" acks provided the external dependencies are worked
out. Christoph sent a nak based on maintainer status. Kent, you can add
me as a reviewer if 1. you think that will help and 2. if you plan to
commit to some sort of more formalized development process that will
facilitate review..? I don't know if that means an ack from Christoph,
but perhaps it addresses the nak. I don't really expect anybody to
review the entire codebase, but obviously it's available for anybody who
might want to dig into certain areas in more detail..

Brian

Jan Kara July 7, 2023, 1:13 p.m. UTC | #64

On Thu 06-07-23 18:43:14, Kent Overstreet wrote:
> On Thu, Jul 06, 2023 at 02:19:14PM -0700, Darrick J. Wong wrote:
> > /me shrugs, been on vacation and in hospitals for the last month or so.
> > 
> > >      bcachefs doesn't use sget() for mutual exclusion because a) sget()
> > >      is insane, what we really want is the _block device_ to be opened
> > >      exclusively (which we do), and we have our own block device opening
> > >      path - which we need to, as we're a multi device filesystem.
> > 
> > ...and isn't jan kara already messing with this anyway?
> 
> The blkdev_get_handle() patchset? I like that, but I don't think that's
> related - if there's something more related to sget() I haven't seen it
> yet

There's a series on top of that that also modifies how sget() works [1].
Christian wants that bit to be merged separately from the bdev handle stuff
and Christoph chimed in with some other related cleanups so he'll now take
care of that change.

Anyhow we should have sget() that does not exclusively claim the bdev
unless it needs to create a new superblock soon.

								Honza

[1] https://lore.kernel.org/all/20230704125702.23180-6-jack@suse.cz

Kent Overstreet July 7, 2023, 1:52 p.m. UTC | #65

On Fri, Jul 07, 2023 at 03:13:06PM +0200, Jan Kara wrote:
> On Thu 06-07-23 18:43:14, Kent Overstreet wrote:
> > On Thu, Jul 06, 2023 at 02:19:14PM -0700, Darrick J. Wong wrote:
> > > /me shrugs, been on vacation and in hospitals for the last month or so.
> > > 
> > > >      bcachefs doesn't use sget() for mutual exclusion because a) sget()
> > > >      is insane, what we really want is the _block device_ to be opened
> > > >      exclusively (which we do), and we have our own block device opening
> > > >      path - which we need to, as we're a multi device filesystem.
> > > 
> > > ...and isn't jan kara already messing with this anyway?
> > 
> > The blkdev_get_handle() patchset? I like that, but I don't think that's
> > related - if there's something more related to sget() I haven't seen it
> > yet
> 
> There's a series on top of that that also modifies how sget() works [1].
> Christian wants that bit to be merged separately from the bdev handle stuff
> and Christoph chimed in with some other related cleanups so he'll now take
> care of that change.
> 
> Anyhow we should have sget() that does not exclusively claim the bdev
> unless it needs to create a new superblock soon.

Thanks for the link

sget() felt a bit odd in bcachefs because we have our own bch2_fs_open()
path that, completely separately from the VFS opens a list of block
devices and returns a fully constructed filesystem handle.

We need this because it's also used in userspace, where we don't have
the VFS and it wouldn't make much sense to lift sget(), for e.g. fsck
and other tools.

IOW, we really do need to own the whole codepath that opens the actual
block devices; our block device open path does things like parse the
opts struct to decide whether to open the block device in write mode or
exclusive mode...

So the way around this in bcachefs is we call sget() twice - first in
"find an existing sb but don't create one" mode, then if that fails we
call bch2_fs_open() and call sget() again in "create a super_block and
attach it to this bch_fs" - a bit awkward but it works.

Not sure if this has come up in other filesystems, but here's the
relevant bcachefs code:
https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs.c#n1756

Kent Overstreet July 7, 2023, 2:49 p.m. UTC | #66

On Fri, Jul 07, 2023 at 08:18:28AM -0400, Brian Foster wrote:
> As it is, I have been able to dig into various areas of the code, learn
> some of the basic principles, diagnose/fix issues and get some of those
> fixes merged without too much trouble. IMO, the code is fairly well
> organized at a high level, reasonably well documented and
> debuggable/supportable. That isn't to say some of those things couldn't
> be improved (and I expect they will be), but these are more time and
> resource constraints than anything and so I don't see any major red
> flags in that regard. Some of my bigger personal gripes would be a lot
> of macro code generation stuff makes it a bit harder (but not
> impossible) for a novice to come up to speed,

Yeah, we use x-macros extensively for e.g. enums so we can also generate
string arrays. Wonderful for the to_text functions, annoying for
breaking ctags/cscope.

> and similarly a bit more
> introductory/feature level documentation would be useful to help
> navigate areas of code without having to rely on Kent as much. The
> documentation that is available is still pretty good for gaining a high
> level understanding of the fs data structures, though I agree that more
> content on things like on-disk format would be really nice.

A thought I'd been meaning to share anyways: when there's someone new
getting up to speed on the codebase, I like to use it as an opportunity
to write documentation.

If anyone who's working on the code asks for a section of code to be
documented - just tell me what you're looking at and give me an idea of
what your questions are and I'll write out a patch adding kdoc comments.
For me, this is probably the lowest stress way to get code documentation
written, and that way it's added to the code for the next person too.

In the past we've also done meetings where we looked at source code
together and I walked people through various codepaths - I think those
were effective at getting people some grounding, and especially if
there's more people interested I'd be happy to do that again.

Also, I'm pretty much always on IRC - don't hesitate to use me as a
resource!

> Functionality wise I think it's inevitable that there will be some
> growing pains as user and developer base grows. For that reason I think
> having some kind of experimental status for a period of time is probably
> the right approach. Most of the issues I've dug into personally have
> been corner case type things, but experience shows that these are the
> sorts of things that eventually arise with more users. We've also
> briefly discussed things like whether bcachefs could take more advantage
> of some of the test coverage that btrfs already has in fstests, since
> the feature sets should largely overlap. That is TBD, but is something
> else that might be a good step towards further proving out reliability.

Yep, bcachefs implements essentially the same basic user interface for
subvolumes/snapshots, and test coverage for snapshots is an area where
we're still somewhat weak.

> Related to that, something I'm not sure I've seen described anywhere is
> the functional/production status of the filesystem itself (not
> necessarily the development status of the various features). For
> example, is the filesystem used in production at any level? If so, what
> kinds of deployments, workloads and use cases do you know about? How
> long have they been in use, etc.? I realize we may not have knowledge or
> permission to share details, but any general info about usage in the
> wild would be interesting.

I don't have any hard numbers, I can only try to infer. But it has been
used in production (by paying customers) at least a few sites for
several years; I couldn't say how many because I only find out when
something breaks :) In the wider community, it's at least hundreds,
likely thousands based on distinct users reporting bugs, the ammount of
hammering on my git server since it got packaged in nixos, etc.

There's users in the IRC channel who've been running it on multiple
machines for probably 4-5 years, and generally continuously upgrading
them (I've never done an on disk format change that required a mkfs);
I've been running it on my laptops for about that long as well.

Based on the types of bug reports I've been seeing, things have been
stabilizing quite nicely - AFAIK no one's losing data; we do have some
outstanding filesystem corruption bugs but they're little things that
fsck can easily repair and don't lead to data loss (e.g. the erasure
coding tests are complaining about disk space utilization counters being
wrong, some of our tests are still finding the occasional backpointers
bug - Brian just started looking at that one :)

The exception is snapshots, there's a user in China who's been throwing
crazy database workloads at bcachefs - that's still seeing some data
corruption (he sent me a filesystem metadata dump with 2k snapshots that
hit O(n^3) algorithms in fsck, fixes for that are mostly done) - once I
get back to that work and doing more proper torture testing that should
be ironed out soon, we know where the issue is now.

> The development process is fairly ad hoc, so I suspect that is something
> that would have to evolve if this lands upstream. Kent, did you have
> thoughts/plans around that? I don't mind contributing reviews where I
> can, but that means patches would be posted somewhere for feedback, etc.
> I suppose that has potential to slow things down, but also gives people
> a chance to see what's happening, review or ask questions, etc., which
> is another good way to learn or simply keep up with things.

Yeah, that's a good discussion.

I wouldn't call the current development process "ad hoc", it's the
process that's evolved to let me write code the fastest without making
users unhappy :) and that mostly revolves around good test
infrastructure, and a well structured codebase with excellent assertions
so that we can make changes with high confidence that if the tests pass
it's good.

Regarding code review: We do need to do more of that going forward, and
probably talk about what's most comfortable for people, but I'm also not
a big fan of how I see code review typically happening on the kernel
mailing lists and I want to try to push things in a different direction
for bcachefs.

In my opinion, the way we tend to do code review tends to be on the very
fastidious/nitpicky side of things; and this made sense historically
when kernel testing was _hard_ and we depended a lot more on human
review to catch errors. But the downside of that kind of code review is
it's a big time sink, and it burns people out (both the reviewers, and
the people who are trying to get reviews!) - and when the discussion is
mostly about nitpicky things, that takes away energy that could be going
into the high level "what do we want to do and what ideas do we have for
how to get there" discussions.

When we're doing code review for bcachefs, I don't want to see people
nitpicking style and complaining about the style of if statements, and I
don't want people poring over every detail trying to catch bugs that our
test infrastructure will catch. Instead, save that energy for:

 - identifying things that are legitimately tricky or have a high
   probability of introducing errors that won't be caught by tests:
   that's something we do want to talk about, that's being proactive

 - looking at the code coverage analysis to see where we're missing
   tests (we have automated code coverage analysis now!)

 - making sure changes are sane and _understandable_

 - and just keeping abreast of each other's work. We don't need to get
   every detail, just the gist so we can understand each other's goals.

The interactions in engineering teams that I've found to be the most
valuable has never been code review, it's the more abstract discussions
that happen _after_ we all understand what we're trying to do. That's
what I want to see more of.

Now, getting back to "how are we going to do code review" discussion - I
personally prefer to do code review over IRC with a link to their git
repository; I find a conversational format and quick feedback to be very
valuable (I do not want people blocked because they're waiting on code
review).

But the mailing list sees a wider audience, so I see no reason why we
can't start sending all our patches to the linux-bcachefs mailing list
as well.

Regarding the "more abstract, what are we trying to do" discussions: I'm
currently hosting a bcachefs cabal meeting every other week, and I might
bump it up to once a week soon - email me if you'd like an invite, the
wider fs community is definitely meant to be included.

I've also experimented in the past with an open voice chat/conference
call (hosted via the matrix channel); most of us aren't in office
environments anymore, but the shared office with people working on
similar things was great for quickly getting people up to speed, and the
voice chat seemed to work well for that - I'm inclined to start doing
that again.

> All in all I pretty much agree with Josef wrt to the merge request. ISTM
> the main issues right now are the external dependencies and
> development/community situation (i.e. bus factor). As above, I plan to
> continue contributions at least in terms of fixes and whatnot so long as
> $employer continues to allow me to dedicate at least some time to it and
> the community is functional ;), but it's not clear to me if that is
> sufficient to address the concerns here. WRT the dependencies, I agree
> it makes sense to be deliberate and for anything that is contentious,
> either just drop it or lift it into bcachefs for now to avoid the need
> to debate on these various fronts in the first place (and simplify the
> pull request as much as possible).

I'd hoped we can table the discussion on "dependencies" in the abstract.
Prior consensus, from multiple occasions when I was feeding in bcachefs
prep work, was that patches that were _only_ needed for bcachefs should
be part of the bcachefs pull request - that's what I've been sticking
to.

Slimming down the dependencies any further will require non-ideal
engineering tradeoffs, so any request/suggestion to do so needs to come
with some specifics. And Jens already ok'd the 4 block patches, which
were the most significant. 

> With those issues addressed, perhaps it would be helpful if other
> interested fs maintainers/devs could chime in with any thoughts on what
> they'd want to see in order to ack (but not necessarily "review") a new
> filesystem pull request..? I don't have the context of the off list
> thread, but from this thread ISTM that perhaps Josef and Darrick are
> close to being "soft" acks provided the external dependencies are worked
> out. Christoph sent a nak based on maintainer status. Kent, you can add
> me as a reviewer if 1. you think that will help and 2. if you plan to
> commit to some sort of more formalized development process that will
> facilitate review..?

That sounds agreeable :)

James Bottomley July 7, 2023, 4:26 p.m. UTC | #67

On Fri, 2023-07-07 at 05:18 -0400, Kent Overstreet wrote:
> On Fri, Jul 07, 2023 at 10:48:55AM +0200, Christian Brauner wrote:
> > > just merge it and let's move on to the next thing."
> > 
> > "and let the block and vfs maintainers and developers deal with the
> > fallout"
> > 
> > is how that reads to others that deal with 65+ filesystems and
> > counting.
> > 
> > The offlist thread that was started by Kent before this pr was sent
> > has seen people try to outline calmly what problems they currently
> > still have both maintenance wise and upstreaming wise. And it seems
> > there's just no way this can go over calmly but instead requires
> > massive amounts of defensive pushback and grandstanding.
> > 
> > Our main task here is to consider the concerns of people that
> > constantly review and rework massive amounts of generic code. And I
> > can't in good conscience see their concerns dismissed with snappy
> > quotes.
> > 
> > I understand the impatience, I understand the excitement, I really
> > do. But not in this way where core people just drop off because
> > they don't want to deal with this anymore.
> > 
> > I've spent enough time on this thread.
> 
> Christain, the hostility I'm reading is in your steady passive
> aggressive accusations, and your patronizing attitude. It's not
> professional, and it's not called for.

Can you not see that saying this is a huge red flag?  With you every
disagreement becomes, as Josef said, "a hill to die on" and you then
feel entitled to indulge in ad hominem attacks, like this, or be
dismissive or try to bury whoever raised the objection in technical
minutiae in the hope you can demonstrate you have a better grasp of the
details than they do and therefore their observation shouldn't count.

One of a maintainer's jobs is to nurture and build a community and
that's especially important at the inclusion of a new feature.  What
we've seen from you implies you'd build a community of little Kents
(basically an echo chamber of people who agree with you) and use them
as a platform to attack any area of the kernel you didn't agree with
technically (which, apparently, would be most of block and vfs with a
bit of mm thrown in), leading to huge divisions and infighting.  Anyone
who had the slightest disagreement with you would be out and would
likely behave in the same way you do now leading to internal community
schisms and more fighting on the lists.

We've spent years trying to improve the lists and make the community
welcoming.  However technically brilliant a new feature is, it can't
come with this sort of potential for community and reputational damage.

> Can we please try to stay focused on the code, and the process, and
> the _actual_ concerns?
> 
> In that offlist thread, I don't recall much in the way of actual,
> concrete concerns. I do recall Christoph doing his usual schpiel; and
> to be clear, I cut short my interactions with Christoph because in
> nearly 15 years of kernel development he's never been anything but
> hostile to anything I've posted, and the criticisms he posts tend to
> be vague and unaware of the surrounding discussion, not anything
> actionable.

This too is a red flag.  Working with difficult people is one of a
maintainer's jobs as well.  Christoph has done an enormous amount of
highly productive work over the years.  Sure, he's prickly and sure
there have been fights, but everyone except you seems to manage to
patch things up and accept his contributions.  If it were just one
personal problem it might be overlookable, but you seem to be having
major fights with the maintainer of every subsystem you touch...

James

Kent Overstreet July 7, 2023, 4:48 p.m. UTC | #68

On Fri, Jul 07, 2023 at 12:26:19PM -0400, James Bottomley wrote:
> On Fri, 2023-07-07 at 05:18 -0400, Kent Overstreet wrote:
> > On Fri, Jul 07, 2023 at 10:48:55AM +0200, Christian Brauner wrote:
> > > > just merge it and let's move on to the next thing."
> > > 
> > > "and let the block and vfs maintainers and developers deal with the
> > > fallout"
> > > 
> > > is how that reads to others that deal with 65+ filesystems and
> > > counting.
> > > 
> > > The offlist thread that was started by Kent before this pr was sent
> > > has seen people try to outline calmly what problems they currently
> > > still have both maintenance wise and upstreaming wise. And it seems
> > > there's just no way this can go over calmly but instead requires
> > > massive amounts of defensive pushback and grandstanding.
> > > 
> > > Our main task here is to consider the concerns of people that
> > > constantly review and rework massive amounts of generic code. And I
> > > can't in good conscience see their concerns dismissed with snappy
> > > quotes.
> > > 
> > > I understand the impatience, I understand the excitement, I really
> > > do. But not in this way where core people just drop off because
> > > they don't want to deal with this anymore.
> > > 
> > > I've spent enough time on this thread.
> > 
> > Christain, the hostility I'm reading is in your steady passive
> > aggressive accusations, and your patronizing attitude. It's not
> > professional, and it's not called for.
> 
> Can you not see that saying this is a huge red flag?  With you every
> disagreement becomes, as Josef said, "a hill to die on" and you then
> feel entitled to indulge in ad hominem attacks, like this, or be
> dismissive or try to bury whoever raised the objection in technical
> minutiae in the hope you can demonstrate you have a better grasp of the
> details than they do and therefore their observation shouldn't count.
> 
> One of a maintainer's jobs is to nurture and build a community and
> that's especially important at the inclusion of a new feature.  What
> we've seen from you implies you'd build a community of little Kents
> (basically an echo chamber of people who agree with you) and use them
> as a platform to attack any area of the kernel you didn't agree with
> technically (which, apparently, would be most of block and vfs with a
> bit of mm thrown in), leading to huge divisions and infighting.  Anyone
> who had the slightest disagreement with you would be out and would
> likely behave in the same way you do now leading to internal community
> schisms and more fighting on the lists.
> 
> We've spent years trying to improve the lists and make the community
> welcoming.  However technically brilliant a new feature is, it can't
> come with this sort of potential for community and reputational damage.
> 
> > Can we please try to stay focused on the code, and the process, and
> > the _actual_ concerns?
> > 
> > In that offlist thread, I don't recall much in the way of actual,
> > concrete concerns. I do recall Christoph doing his usual schpiel; and
> > to be clear, I cut short my interactions with Christoph because in
> > nearly 15 years of kernel development he's never been anything but
> > hostile to anything I've posted, and the criticisms he posts tend to
> > be vague and unaware of the surrounding discussion, not anything
> > actionable.
> 
> This too is a red flag.  Working with difficult people is one of a
> maintainer's jobs as well.  Christoph has done an enormous amount of
> highly productive work over the years.  Sure, he's prickly and sure
> there have been fights, but everyone except you seems to manage to
> patch things up and accept his contributions.  If it were just one
> personal problem it might be overlookable, but you seem to be having
> major fights with the maintainer of every subsystem you touch...

James, I will bend over backwards to work with people who will work to
continue the technical discussion.

That's what I'm here to do.

I'm going to bow out of this line of discussion on the thread. Feel free
to continue privately if you like.

James Bottomley July 7, 2023, 5:04 p.m. UTC | #69

On Fri, 2023-07-07 at 12:48 -0400, Kent Overstreet wrote:
> On Fri, Jul 07, 2023 at 12:26:19PM -0400, James Bottomley wrote:
> > On Fri, 2023-07-07 at 05:18 -0400, Kent Overstreet wrote:
[...]
> > > In that offlist thread, I don't recall much in the way of actual,
> > > concrete concerns. I do recall Christoph doing his usual schpiel;
> > > and to be clear, I cut short my interactions with Christoph
> > > because in nearly 15 years of kernel development he's never been
> > > anything but hostile to anything I've posted, and the criticisms
> > > he posts tend to be vague and unaware of the surrounding
> > > discussion, not anything actionable.
> > 
> > This too is a red flag.  Working with difficult people is one of a
> > maintainer's jobs as well.  Christoph has done an enormous amount
> > of highly productive work over the years.  Sure, he's prickly and
> > sure there have been fights, but everyone except you seems to
> > manage to patch things up and accept his contributions.  If it were
> > just one personal problem it might be overlookable, but you seem to
> > be having major fights with the maintainer of every subsystem you
> > touch...
> 
> James, I will bend over backwards to work with people who will work
> to continue the technical discussion.

You will?  Because that doesn't seem to align with your statement about
Christoph being "vague and unaware of the surrounding discussions" and
not posting "anything actionable" for the last 15 years.  No-one else
has that impression and we've almost all had run-ins with Christoph at
some point.

James

Kent Overstreet July 7, 2023, 5:26 p.m. UTC | #70

On Fri, Jul 07, 2023 at 01:04:14PM -0400, James Bottomley wrote:
> On Fri, 2023-07-07 at 12:48 -0400, Kent Overstreet wrote:
> > On Fri, Jul 07, 2023 at 12:26:19PM -0400, James Bottomley wrote:
> > > On Fri, 2023-07-07 at 05:18 -0400, Kent Overstreet wrote:
> [...]
> > > > In that offlist thread, I don't recall much in the way of actual,
> > > > concrete concerns. I do recall Christoph doing his usual schpiel;
> > > > and to be clear, I cut short my interactions with Christoph
> > > > because in nearly 15 years of kernel development he's never been
> > > > anything but hostile to anything I've posted, and the criticisms
> > > > he posts tend to be vague and unaware of the surrounding
> > > > discussion, not anything actionable.
> > > 
> > > This too is a red flag.  Working with difficult people is one of a
> > > maintainer's jobs as well.  Christoph has done an enormous amount
> > > of highly productive work over the years.  Sure, he's prickly and
> > > sure there have been fights, but everyone except you seems to
> > > manage to patch things up and accept his contributions.  If it were
> > > just one personal problem it might be overlookable, but you seem to
> > > be having major fights with the maintainer of every subsystem you
> > > touch...
> > 
> > James, I will bend over backwards to work with people who will work
> > to continue the technical discussion.
> 
> You will?  Because that doesn't seem to align with your statement about
> Christoph being "vague and unaware of the surrounding discussions" and
> not posting "anything actionable" for the last 15 years.  No-one else
> has that impression and we've almost all had run-ins with Christoph at
> some point.

If I'm going to respond to this I'd have to start citing interactions
and I don't want to dig things that deep in public.

Can we either try to resolve this privately or drop it?

Matthew Wilcox July 8, 2023, 3:54 a.m. UTC | #71

On Fri, Jul 07, 2023 at 12:26:19PM -0400, James Bottomley wrote:
> On Fri, 2023-07-07 at 05:18 -0400, Kent Overstreet wrote:
> > Christain, the hostility I'm reading is in your steady passive
> > aggressive accusations, and your patronizing attitude. It's not
> > professional, and it's not called for.
> 
> Can you not see that saying this is a huge red flag?  With you every
> disagreement becomes, as Josef said, "a hill to die on" and you then
> feel entitled to indulge in ad hominem attacks, like this, or be
> dismissive or try to bury whoever raised the objection in technical
> minutiae in the hope you can demonstrate you have a better grasp of the
> details than they do and therefore their observation shouldn't count.
> 
> One of a maintainer's jobs is to nurture and build a community and
> that's especially important at the inclusion of a new feature.  What
> we've seen from you implies you'd build a community of little Kents
> (basically an echo chamber of people who agree with you) and use them
> as a platform to attack any area of the kernel you didn't agree with
> technically (which, apparently, would be most of block and vfs with a
> bit of mm thrown in), leading to huge divisions and infighting.  Anyone
> who had the slightest disagreement with you would be out and would
> likely behave in the same way you do now leading to internal community
> schisms and more fighting on the lists.
> 
> We've spent years trying to improve the lists and make the community
> welcoming.  However technically brilliant a new feature is, it can't
> come with this sort of potential for community and reputational damage.

I don't think the lists are any better, tbh.  Yes, the LF has done a great
job of telling people not to use "bad words" any more.  But people are
still arseholes to each other.  They're just more subtle about it now.
I'm not going to enumerate the ways because that's pointless.

Consider this thread from Kent's point of view.  He's worked for years
on bcachefs.  Now he's asking "What needs to happen to get this merged?"
And instead of getting a clear answer as to the technical pieces that
need to get fixed, various people are taking the opportunity to tell him
he's a Bad Person.  And when he reacts to that, this is taken as more
evidence that he's a Bad Person, rather than being a person who is in
a stressful situation (Limbo?  Purgatory?) who is perhaps not reacting
in the most constructive way.

I don't think Kent is particularly worse as a fellow developer than you
or I or Jens, Greg, Al, Darrick, Dave, Dave, Dave, Dave, Josef or Brian.
There are some social things which are a concern to me.  There's no
obvious #2 or #3 to step in if Kent does get hit by the proverbial bus,
but that's been discussed elsewhere in the thread.

Anyway, I'm in favour of bcachefs inclusion.  I think the remaining
problems can be worked out post-merge.  I don't see Kent doing a
drop-and-run on the codebase.  Maintaining this much code outside the
main kernel tree is hard.  One thing I particularly like about btrfs
compared to ntfs3 is that it doesn't use old legacy code like the buffer
heads, which means that it doesn't add to the technical debt.  From the
page cache point of view, it's fairly clean.  I wish it used iomap, but
iomap would need quite a lot of new features to accommodate everything
bcachefs wants to do.  Maybe iomap will grow those features over time.

Kent Overstreet July 8, 2023, 4:10 a.m. UTC | #72

On Sat, Jul 08, 2023 at 04:54:22AM +0100, Matthew Wilcox wrote:
> One thing I particularly like about btrfs

:) 

> compared to ntfs3 is that it doesn't use old legacy code like the buffer
> heads, which means that it doesn't add to the technical debt.  From the
> page cache point of view, it's fairly clean.  I wish it used iomap, but
> iomap would need quite a lot of new features to accommodate everything
> bcachefs wants to do.  Maybe iomap will grow those features over time.

My big complaint with iomap is that it's still the old callback based
approach - an indirect function call into the filesystem to get a
mapping, then Doing Stuff, for every walk.

Instead of calling back and forth, we could be filling out a data
structure to represent the IO, then handing it off to the filesystem to
look up the mappings and send to the right place, splitting as needed.
Best part is, we already have such a data structure: struct bio. That's
the approach bcachefs takes.

It would be nice sharing the page cache management code, but like you
mentioned, iomap would have to grow a bunch of features. But, some of
those features other users might like: in particular bcachefs hangs disk
reservations and dirty sector (for i_blocks accounting) off the
pagecache, which to me is a total no brainer, it eliminates looking up
in a second data structure for e.g. the buffered write path.

Also worth noting - bcachefs has had large folio support for awhile now :)

Kent Overstreet July 8, 2023, 4:31 a.m. UTC | #73

On Sat, Jul 08, 2023 at 04:54:22AM +0100, Matthew Wilcox wrote:
> On Fri, Jul 07, 2023 at 12:26:19PM -0400, James Bottomley wrote:
> > On Fri, 2023-07-07 at 05:18 -0400, Kent Overstreet wrote:
> > > Christain, the hostility I'm reading is in your steady passive
> > > aggressive accusations, and your patronizing attitude. It's not
> > > professional, and it's not called for.
> > 
> > Can you not see that saying this is a huge red flag?  With you every
> > disagreement becomes, as Josef said, "a hill to die on" and you then
> > feel entitled to indulge in ad hominem attacks, like this, or be
> > dismissive or try to bury whoever raised the objection in technical
> > minutiae in the hope you can demonstrate you have a better grasp of the
> > details than they do and therefore their observation shouldn't count.
> > 
> > One of a maintainer's jobs is to nurture and build a community and
> > that's especially important at the inclusion of a new feature.  What
> > we've seen from you implies you'd build a community of little Kents
> > (basically an echo chamber of people who agree with you) and use them
> > as a platform to attack any area of the kernel you didn't agree with
> > technically (which, apparently, would be most of block and vfs with a
> > bit of mm thrown in), leading to huge divisions and infighting.  Anyone
> > who had the slightest disagreement with you would be out and would
> > likely behave in the same way you do now leading to internal community
> > schisms and more fighting on the lists.
> > 
> > We've spent years trying to improve the lists and make the community
> > welcoming.  However technically brilliant a new feature is, it can't
> > come with this sort of potential for community and reputational damage.
> 
> I don't think the lists are any better, tbh.  Yes, the LF has done a great
> job of telling people not to use "bad words" any more.  But people are
> still arseholes to each other.  They're just more subtle about it now.
> I'm not going to enumerate the ways because that's pointless.

I've long thought a more useful CoC would start with "always try to
continue the technical conversation in good faith, always try to build
off of what other people are saying; don't shut people down".

The work we do has real consequences. There are consequences for the
people doing the work, and consequences for the people that use our work
if we screw things up. Things are bound to get heated at times; that's
expected, and it's ok - as long as we can remember to keep doing the
work and pushing forward.

Theodore Ts'o July 8, 2023, 3:02 p.m. UTC | #74

On Sat, Jul 08, 2023 at 12:31:36AM -0400, Kent Overstreet wrote:
> 
> I've long thought a more useful CoC would start with "always try to
> continue the technical conversation in good faith, always try to build
> off of what other people are saying; don't shut people down".

Kent, with all due respect, do you not always follow your suggested
formulation that you've stated above.  That is to say, you do not
always assume that your conversational partner is trying to raise
objections in good faith.  You also want to assume that you are the
smartest person in the room, and if they object, they are Obviously
Wrong.

As a result, it's not pleasant to have a technical conversation with
you, and as others have said, when someone like Christian Brauner has
decided that it's too frustating to continue with the thread, given my
observations of his past interaction with a wide variety of people,
including some folks who have been traditionally regarded as
"difficult to work with", it's a real red flag.

Regards,

					- Ted

Kent Overstreet July 8, 2023, 3:23 p.m. UTC | #75

On Sat, Jul 08, 2023 at 11:02:49AM -0400, Theodore Ts'o wrote:
> On Sat, Jul 08, 2023 at 12:31:36AM -0400, Kent Overstreet wrote:
> > 
> > I've long thought a more useful CoC would start with "always try to
> > continue the technical conversation in good faith, always try to build
> > off of what other people are saying; don't shut people down".
> 
> Kent, with all due respect, do you not always follow your suggested
> formulation that you've stated above.  That is to say, you do not
> always assume that your conversational partner is trying to raise
> objections in good faith. 

Ted, how do you have a technical conversation with someone who refuses
to say anything concrete, even when you ask them to elaborate on their
objections, and instead just repeats the same vague non-answers?

> You also want to assume that you are the smartest person in the room,
> and if they object, they are Obviously Wrong.

Ok, now you're really reaching.

Anyone who's actually worked with me can tell you I am quick to consider
other people's point of view and quick to admit when I'm wrong.

All I ask is the same courtesy.

James Bottomley July 8, 2023, 4:42 p.m. UTC | #76

On Sat, 2023-07-08 at 04:54 +0100, Matthew Wilcox wrote:
> On Fri, Jul 07, 2023 at 12:26:19PM -0400, James Bottomley wrote:
> > On Fri, 2023-07-07 at 05:18 -0400, Kent Overstreet wrote:
> > > Christain, the hostility I'm reading is in your steady passive
> > > aggressive accusations, and your patronizing attitude. It's not
> > > professional, and it's not called for.
> > 
> > Can you not see that saying this is a huge red flag?  With you
> > every disagreement becomes, as Josef said, "a hill to die on" and
> > you then feel entitled to indulge in ad hominem attacks, like this,
> > or be dismissive or try to bury whoever raised the objection in
> > technical minutiae in the hope you can demonstrate you have a
> > better grasp of the details than they do and therefore their
> > observation shouldn't count.
> > 
> > One of a maintainer's jobs is to nurture and build a community and
> > that's especially important at the inclusion of a new feature. 
> > What we've seen from you implies you'd build a community of little
> > Kents (basically an echo chamber of people who agree with you) and
> > use them as a platform to attack any area of the kernel you didn't
> > agree with technically (which, apparently, would be most of block
> > and vfs with a bit of mm thrown in), leading to huge divisions and
> > infighting.  Anyone who had the slightest disagreement with you
> > would be out and would likely behave in the same way you do now
> > leading to internal community schisms and more fighting on the
> > lists.
> > 
> > We've spent years trying to improve the lists and make the
> > community welcoming.  However technically brilliant a new feature
> > is, it can't come with this sort of potential for community and
> > reputational damage.
> 
> I don't think the lists are any better, tbh.  Yes, the LF has done a
> great job of telling people not to use "bad words" any more.  But
> people are still arseholes to each other.

I don't think the LF has done much actively on the lists ... we've been
trying to self police.

>   They're just more subtle about it now. I'm not going to enumerate
> the ways because that's pointless.

Well, we can agree to differ since this isn't relevant to the main
argument.

> Consider this thread from Kent's point of view.  He's worked for
> years on bcachefs.  Now he's asking "What needs to happen to get this
> merged?" And instead of getting a clear answer as to the technical
> pieces that need to get fixed, various people are taking the
> opportunity to tell him he's a Bad Person.  And when he reacts to
> that, this is taken as more evidence that he's a Bad Person, rather
> than being a person who is in a stressful situation (Limbo? 
> Purgatory?) who is perhaps not reacting in the most constructive way.

That's a bit of a straw man: I never said or implied "bad person".  I
gave two examples, one from direct list interaction and one quoted from
Kent of what I consider to be red flags behaviours on behalf of a
maintainer.

> I don't think Kent is particularly worse as a fellow developer than
> you or I or Jens, Greg, Al, Darrick, Dave, Dave, Dave, Dave, Josef or
> Brian.

I don't believe any of us have been unable to work with a fairly
prolific contributor for 15 years ...

>  There are some social things which are a concern to me.  There's no
> obvious #2 or #3 to step in if Kent does get hit by the proverbial
> bus, but that's been discussed elsewhere in the thread.

Actually, I don't think this is a problem: a new feature has no users
and having no users, it doesn't matter if it loses its only maintainer
because it can be excised without anyone really noticing.  The bus
problem (or more accurately xkcd 2347 problem) commonly applies to a
project with a lot of users but an anaemic developer community, which
is a thing that can be grown to but doesn't happen ab initio.  The
ordinary course for a kernel feature is single developer; hobby project
(small community of users as developers); and eventually a non
technical user community.  Usually the hobby project phase grows enough
interested developers to ensure a fairly healthy developer community by
the time it actually acquires non developer users (and quite a few of
our features never actually get out of the hobby project phase).

James

Kent Overstreet July 9, 2023, 1:16 a.m. UTC | #77

On Sat, Jul 08, 2023 at 12:42:59PM -0400, James Bottomley wrote:
> That's a bit of a straw man: I never said or implied "bad person".  I
> gave two examples, one from direct list interaction and one quoted from
> Kent of what I consider to be red flags behaviours on behalf of a
> maintainer.

You responded with a massive straw man about an army of little
Kents - seriously, what the hell was that about?

The only maintainers that I've had ongoing problems with have been Jens
and Christoph, and there's more history to that than I want to get into.

If you're talking about _our_ disagreement, I was arguing that cut and
pasting code from other repositories is a terrible workflow that's going
to cause us problems down the road, especially for the Rust folks, and
then afterwards you started hounding me in unrelated LKML discussions.

So clearly you took that personally, and I think maybe you still are.

Kent Overstreet July 12, 2023, 2:54 a.m. UTC | #78

So: looks like we missed the merge window. Boo :)

Summing up discussions from today's cabal meeting, other off list
discussions, and this thread:

 - bcachefs is now marked EXPERIMENTAL

 - Brian Foster will be listed as a reviewer

 - Josef's stepping up to do some code review, focusing on vfs-interacty
   bits. I'm hoping to do at least some of this in a format where Josef
   peppers me with questions and we turn that into new code
   documentation, so others can directly benefit: if anyone has an area
   they work on and would like to see documented in bcachefs, we'll take
   a look at that too.

 - Prereq patch series has been pruned down a bit more; also Mike
   Snitzer suggested putting those patches in their own branch:

   https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-prereqs

   "iov_iter: copy_folio_from_iter_atomic()" was dropped and replaced
   with willy's "iov_iter: Handle compound highmem pages in
   copy_page_from_iter_atomic()"; he said he'd try to send this for -rc4
   since it's technically a bug fix; in the meantime, it'll be getting
   more testing from my users.

   The two lockdep patches have been dropped for now; the
   bcachefs-for-upstream branch is switched back to
   lockdep_set_novalidate_class() for btree node locks. 

   six locks, mean and variance have been moved into fs/bcachefs/ for
   now; this means there's a new prereq patch to export
   osq_(lock|unlock)

   The remaining prereq patches are pretty trivial, with the exception
   of "block: Don't block on s_umount from __invalidate_super()". I
   would like to get a reviewed-by for that patch, and it wouldn't hurt
   for others.

   previously posting:
   https://lore.kernel.org/linux-bcachefs/20230509165657.1735798-1-kent.overstreet@linux.dev/T/#m34397a4d39f5988cc0b635e29f70a6170927746f

 - Code review was talked about a bit earlier in the thread: for the
   moment I'm just posting big stuff, but I'd like to aim for making
   sure all patches (including mine) hit the linux-bcachefs mailing list
   in the future:

   https://lore.kernel.org/linux-bcachefs/20230709171551.2349961-1-kent.overstreet@linux.dev/T/

 - We also talked quite a bit about the QA process. I'm going to work on
   finally publishing ktest/ktestci, which is my test infrastructure
   that myself and a few other people are using - I'd like to see it
   used more widely.

   For now, here's the test dashboard for the bcachefs-for-upstream
   branch:
   https://evilpiepirate.org/~testdashboard/ci?branch=bcachefs-for-upstream

 - Also: not directly related to upstreaming, but relevant for the
   community: we talked about getting together a meeting with some of
   the btrfs people to gather design input, ideas, and lessons learned.

   If anyone would be interested in working on and improving the multi
   device capabilities of bcachefs in particular, this would be a great
   time to get involved. That stuff is in good shape and seeing a lot of
   active use - it's one of bcachefs's major drawing points - and I want
   it to be even better.

And here's the branch I intend to re-submit next merge window, as it
currently sits:
https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-for-upstream

Please chime in if I forgot anything important... :)

Cheers,
Kent

Kees Cook July 12, 2023, 7:48 p.m. UTC | #79

On Tue, Jul 11, 2023 at 10:54:59PM -0400, Kent Overstreet wrote:
>  - Prereq patch series has been pruned down a bit more; also Mike
>    Snitzer suggested putting those patches in their own branch:
> 
>    https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-prereqs
> 
>    "iov_iter: copy_folio_from_iter_atomic()" was dropped and replaced
>    with willy's "iov_iter: Handle compound highmem pages in
>    copy_page_from_iter_atomic()"; he said he'd try to send this for -rc4
>    since it's technically a bug fix; in the meantime, it'll be getting
>    more testing from my users.
> 
>    The two lockdep patches have been dropped for now; the
>    bcachefs-for-upstream branch is switched back to
>    lockdep_set_novalidate_class() for btree node locks. 
> 
>    six locks, mean and variance have been moved into fs/bcachefs/ for
>    now; this means there's a new prereq patch to export
>    osq_(lock|unlock)
> 
>    The remaining prereq patches are pretty trivial, with the exception
>    of "block: Don't block on s_umount from __invalidate_super()". I
>    would like to get a reviewed-by for that patch, and it wouldn't hurt
>    for others.
> 
>    previously posting:
>    https://lore.kernel.org/linux-bcachefs/20230509165657.1735798-1-kent.overstreet@linux.dev/T/#m34397a4d39f5988cc0b635e29f70a6170927746f

Can you send these prereqs out again, with maintainers CCed
appropriately? (I think some feedback from the prior revision needs to
be addressed first, though. For example, __flatten already exists, etc.)

Kent Overstreet July 12, 2023, 7:57 p.m. UTC | #80

On Wed, Jul 12, 2023 at 12:48:31PM -0700, Kees Cook wrote:
> On Tue, Jul 11, 2023 at 10:54:59PM -0400, Kent Overstreet wrote:
> >  - Prereq patch series has been pruned down a bit more; also Mike
> >    Snitzer suggested putting those patches in their own branch:
> > 
> >    https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-prereqs
> > 
> >    "iov_iter: copy_folio_from_iter_atomic()" was dropped and replaced
> >    with willy's "iov_iter: Handle compound highmem pages in
> >    copy_page_from_iter_atomic()"; he said he'd try to send this for -rc4
> >    since it's technically a bug fix; in the meantime, it'll be getting
> >    more testing from my users.
> > 
> >    The two lockdep patches have been dropped for now; the
> >    bcachefs-for-upstream branch is switched back to
> >    lockdep_set_novalidate_class() for btree node locks. 
> > 
> >    six locks, mean and variance have been moved into fs/bcachefs/ for
> >    now; this means there's a new prereq patch to export
> >    osq_(lock|unlock)
> > 
> >    The remaining prereq patches are pretty trivial, with the exception
> >    of "block: Don't block on s_umount from __invalidate_super()". I
> >    would like to get a reviewed-by for that patch, and it wouldn't hurt
> >    for others.
> > 
> >    previously posting:
> >    https://lore.kernel.org/linux-bcachefs/20230509165657.1735798-1-kent.overstreet@linux.dev/T/#m34397a4d39f5988cc0b635e29f70a6170927746f
> 
> Can you send these prereqs out again, with maintainers CCed
> appropriately? (I think some feedback from the prior revision needs to
> be addressed first, though. For example, __flatten already exists, etc.)

Thanks for pointing that out, I knew it was in the pipeline :)

Will do...

Darrick J. Wong July 12, 2023, 10:10 p.m. UTC | #81

On Tue, Jul 11, 2023 at 10:54:59PM -0400, Kent Overstreet wrote:
> So: looks like we missed the merge window. Boo :)
> 
> Summing up discussions from today's cabal meeting, other off list
> discussions, and this thread:
> 
>  - bcachefs is now marked EXPERIMENTAL
> 
>  - Brian Foster will be listed as a reviewer

/me applauds!

>  - Josef's stepping up to do some code review, focusing on vfs-interacty
>    bits. I'm hoping to do at least some of this in a format where Josef
>    peppers me with questions and we turn that into new code
>    documentation, so others can directly benefit: if anyone has an area
>    they work on and would like to see documented in bcachefs, we'll take
>    a look at that too.
> 
>  - Prereq patch series has been pruned down a bit more; also Mike
>    Snitzer suggested putting those patches in their own branch:
> 
>    https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-prereqs
> 
>    "iov_iter: copy_folio_from_iter_atomic()" was dropped and replaced
>    with willy's "iov_iter: Handle compound highmem pages in
>    copy_page_from_iter_atomic()"; he said he'd try to send this for -rc4
>    since it's technically a bug fix; in the meantime, it'll be getting
>    more testing from my users.
> 
>    The two lockdep patches have been dropped for now; the
>    bcachefs-for-upstream branch is switched back to
>    lockdep_set_novalidate_class() for btree node locks. 
> 
>    six locks, mean and variance have been moved into fs/bcachefs/ for
>    now; this means there's a new prereq patch to export
>    osq_(lock|unlock)
> 
>    The remaining prereq patches are pretty trivial, with the exception
>    of "block: Don't block on s_umount from __invalidate_super()". I
>    would like to get a reviewed-by for that patch, and it wouldn't hurt
>    for others.
> 
>    previously posting:
>    https://lore.kernel.org/linux-bcachefs/20230509165657.1735798-1-kent.overstreet@linux.dev/T/#m34397a4d39f5988cc0b635e29f70a6170927746f
> 
>  - Code review was talked about a bit earlier in the thread: for the
>    moment I'm just posting big stuff, but I'd like to aim for making
>    sure all patches (including mine) hit the linux-bcachefs mailing list
>    in the future:
> 
>    https://lore.kernel.org/linux-bcachefs/20230709171551.2349961-1-kent.overstreet@linux.dev/T/
> 
>  - We also talked quite a bit about the QA process. I'm going to work on
>    finally publishing ktest/ktestci, which is my test infrastructure
>    that myself and a few other people are using - I'd like to see it
>    used more widely.
> 
>    For now, here's the test dashboard for the bcachefs-for-upstream
>    branch:
>    https://evilpiepirate.org/~testdashboard/ci?branch=bcachefs-for-upstream
> 
>  - Also: not directly related to upstreaming, but relevant for the
>    community: we talked about getting together a meeting with some of
>    the btrfs people to gather design input, ideas, and lessons learned.

Please invite me too! :)

Granted XFS doesn't do multi-device support (for large values of
'multi') but now that I've spent 6 years of my life concentrating on
repairability for XFS, I might have a few things to say about bcachefs.

That is if I can shake off the torrent of syzbot crap long enough to
read anything in bcachefs.git. :(

--D

>    If anyone would be interested in working on and improving the multi
>    device capabilities of bcachefs in particular, this would be a great
>    time to get involved. That stuff is in good shape and seeing a lot of
>    active use - it's one of bcachefs's major drawing points - and I want
>    it to be even better.
> 
> And here's the branch I intend to re-submit next merge window, as it
> currently sits:
> https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-for-upstream
> 
> Please chime in if I forgot anything important... :)
> 
> Cheers,
> Kent

Kent Overstreet July 12, 2023, 11:57 p.m. UTC | #82

On Wed, Jul 12, 2023 at 03:10:12PM -0700, Darrick J. Wong wrote:
> On Tue, Jul 11, 2023 at 10:54:59PM -0400, Kent Overstreet wrote:
> >  - Also: not directly related to upstreaming, but relevant for the
> >    community: we talked about getting together a meeting with some of
> >    the btrfs people to gather design input, ideas, and lessons learned.
> 
> Please invite me too! :)
> 
> Granted XFS doesn't do multi-device support (for large values of
> 'multi') but now that I've spent 6 years of my life concentrating on
> repairability for XFS, I might have a few things to say about bcachefs.

Absolutely!

Maybe we could start brainstorming ideas to cover now, on the list? I
honestly know XFS so little (I've read code here and there, but I don't
know much about the high level structure) that I wouldn't know where to
start.

Filesystems are such a huge world of "oh, that would've made my life so
much easier if I'd had that idea at the right time..."

> That is if I can shake off the torrent of syzbot crap long enough to
> read anything in bcachefs.git. :(

:(

Linus Torvalds Aug. 9, 2023, 1:27 a.m. UTC | #83

[ *Finally* getting back to this, I wanted to start reviewing the
changes immediately after the merge window, but something else always
kept coming up .. ]

On Tue, 11 Jul 2023 at 19:55, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> So: looks like we missed the merge window. Boo :)

Well, looking at the latest 'bcachefs-for-upstream' state I see, I'm
happy to see the pre-requisites outside bcachefs being much smaller.

The six locks are now contained within bcachefs, and I like what I see
more now that it doesn't play games with 'u64' and lots of bitfields.

I'm still not actually convinced the locks *work* correctly, but I'm
not seeing huge red flags. I do suspect there are memory ordering
issues in there that would all be hidden on x86, and some of it looks
strange, but not necessarily unfixable.

Example of oddity:

                barrier();
                w->lock_acquired = true;

which really smells like it should be

                smp_store_release(&w->lock_acquired, true);

(and the reader side in six_lock_slowpath() should be a
smp_load_acquire()) because otherwise the preceding __list_del()
writes would seem to possibly by re-ordered by the CPU to past the
lock_acquired write, causing all kinds of problems.

On x86, you'd never see that as an issue, since all writes are
releases, so the 'barrier()' compiler ordering ends up forcing the
right magic.

Some of the other oddity is around the this_cpu ops, but I suspect
that is at least partly then because we don't have acquire/release
versions of the local cpu ops that the code looks like it would want.

I did *not* look at any of the internal bcachefs code itself (apart
from the locking, obviously). I'm not that much of a low-level
filesystem person (outside of the vfs itself), so I just don't care
deeply. I care that it's maintained and that people who *are*
filesystem people are at least not hugely against it.

That said, I do think that the prerequisites should go in first and
independently, and through maintainers.

And there clearly is something very strange going on with superblock
handling and the whole crazy discussion about fput being delayed. It
is what it is, and the patches I saw in this thread to not delay them
were bad.

As to the actual prereqs:

I'm not sure why 'd_mark_tmpfile()' didn't do the d_instantiate() that
everybody seems to want, but it looks fine to me. Maybe just because
Kent wanted the "mark" semantics for the naming. Fine.

The bio stuff should preferably go through Jens, or at least at a
minimum be acked.

The '.faults_disabled_mapping' thing is a bit odd, but I don't hate
it, and I could imagine that other filesystems could possibly use that
approach instead of the current 'pagefault_disable/enable' games and
->nofault games to avoid the whole "use mmap to have the source and
the destination of a write be the same page" thing.

So as things stand now, the stuff outside bcachefs itself I don't find
objectionable.

The stuff _inside_ bcachefs I care about only in the sense that I
really *really* would like a locking person to look at the six locks,
but at the same time as long as it's purely internal to bcachefs and
doesn't possibly affect anything else, I'm not *too* worried about
what I see.

The thing that actually bothers me most about this all is the personal
arguments I saw.  That I don't know what to do about. I don't actually
want to merge this over the objections of Christian, now that we have
a responsible vfs maintainer.

So those kinds of arguments do kind of have to be resolved, even aside
from the "I think the prerequisites should go in separately or at
least be clearly acked" issues.

Sorry for the delay, I really did want to get these comments out
directly after the merge window closed, but this just ended up always
being the "next thing"..

                  Linus

Kent Overstreet Aug. 10, 2023, 3:54 p.m. UTC | #84

Adding Jens to the CC:

On Tue, Aug 08, 2023 at 06:27:29PM -0700, Linus Torvalds wrote:
> [ *Finally* getting back to this, I wanted to start reviewing the
> changes immediately after the merge window, but something else always
> kept coming up .. ]
> 
> On Tue, 11 Jul 2023 at 19:55, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > So: looks like we missed the merge window. Boo :)
> 
> Well, looking at the latest 'bcachefs-for-upstream' state I see, I'm
> happy to see the pre-requisites outside bcachefs being much smaller.
> 
> The six locks are now contained within bcachefs, and I like what I see
> more now that it doesn't play games with 'u64' and lots of bitfields.

Heh, I liked the bitfields - I prefer that to open coding structs, which
is a major pet peeve of mine. But the text size went down a lot a lot
without them (would like to know why the compiler couldn't constant fold
all that stuff out, but... not enough to bother).

Anyways...

> I'm still not actually convinced the locks *work* correctly, but I'm
> not seeing huge red flags. I do suspect there are memory ordering
> issues in there that would all be hidden on x86, and some of it looks
> strange, but not necessarily unfixable.
> 
> Example of oddity:
> 
>                 barrier();
>                 w->lock_acquired = true;
> 
> which really smells like it should be
> 
>                 smp_store_release(&w->lock_acquired, true);
> 
> (and the reader side in six_lock_slowpath() should be a
> smp_load_acquire()) because otherwise the preceding __list_del()
> writes would seem to possibly by re-ordered by the CPU to past the
> lock_acquired write, causing all kinds of problems.
> 
> On x86, you'd never see that as an issue, since all writes are
> releases, so the 'barrier()' compiler ordering ends up forcing the
> right magic.

Yep, agreed.

Also, there's a mildly interesting optimization here: the thread doing
the unlock is taking the lock on behalf of the thread waiting for the
lock, and signalling via the waitlist entry: this means the thread
waiting for the lock doesn't have to touch the cacheline the lock is on
at all. IOW, a better version of the handoff that rwsem/mutex do.

Been meaning to experiment with dropping osq_lock and instead just
adding to the waitlist and spinning on w->lock_acquired; this should
actually simplify the code and be another small optimization (less
bouncing of the lock cacheline).

> Some of the other oddity is around the this_cpu ops, but I suspect
> that is at least partly then because we don't have acquire/release
> versions of the local cpu ops that the code looks like it would want.

You mean using full barriers where acquire/release would be sufficient?

> I did *not* look at any of the internal bcachefs code itself (apart
> from the locking, obviously). I'm not that much of a low-level
> filesystem person (outside of the vfs itself), so I just don't care
> deeply. I care that it's maintained and that people who *are*
> filesystem people are at least not hugely against it.
> 
> That said, I do think that the prerequisites should go in first and
> independently, and through maintainers.

Matthew was planning on sending the iov_iter patch to you - right around
now, I believe, as a bugfix, since right now
copy_page_from_iter_atomic() silently does crazy things if you pass it a
compound page.

Block layer patches aside, are there any _others_ you really want to go
via maintainers? Because the consensus in the past when I was feeding in
prereqs for bcachefs was that patches just for bcachefs should go with
the bcachefs pull request.

> And there clearly is something very strange going on with superblock
> handling

This deserves an explanation because sget() is a bit nutty.

The way sget() is conventionally used for block device filesystems, the
block device open _isn't actually exclusive_ - sure, FMODE_EXCL is used,
but the holder is the fs type pointer, so it won't exclude with other
opens of the same fs type.

That means the only protection from multiple opens scribbling over each
other is sget() itself - but if the bdev handle ever outlives the
superblock we're completely screwed; that's a silent data corruption bug
that we can't easily catch, and if the filesystem teardown path has any
asynchronous stuff going on (and of course it does) that's not a hard
mistake to make. I've observed at least one bug that looked suspiciously
like that, but I don't think I quite pinned it down at the time.

It also forces the caller to separate opening of the block devices from
the rest of filesystem initialization, which is a bit less than ideal.

Anyways, bcachefs just wants to be able to do real exclusive opens of
the block devices, and we do all filesystem bringup with a single
bch2_fs_open() call. I think this could be made to work with the way
sget() wants to work, but it'd require reworking the locking in
sget() - it does everything, including the test() and set() calls, under
a single spinlock.

> and the whole crazy discussion about fput being delayed. It
> is what it is, and the patches I saw in this thread to not delay them
> were bad.

Jens claimed AIO was broken in the same way as io_uring, but it turned
out that it's not - the test he posted was broken.

And io_uring really is broken here. Look, the tests that are breaking
because of this are important ones (generic/388 in particular), and
those tests are no good to us if they're failing because of io_uring
crap and Jens is throwing up his hands and saying "trainwreck!" when we
try to get it fixed.

> As to the actual prereqs:
> 
> I'm not sure why 'd_mark_tmpfile()' didn't do the d_instantiate() that
> everybody seems to want, but it looks fine to me. Maybe just because
> Kent wanted the "mark" semantics for the naming. Fine.

Originally, we were doing d_instantiate() separately, in common code,
and the d_mark_tmpfile() was separate. Looking over the code again that
would still be a reasonable approach, so I'd keep it that way.

> The bio stuff should preferably go through Jens, or at least at a
> minimum be acked.

So, the block layer patches have been out on the list and been
discussed, and they got an "OK" from Jens -
https://lore.kernel.org/linux-fsdevel/aeb2690c-4f0a-003d-ba8b-fe06cd4142d1@kernel.dk/

But that's a little ambiguous - Jens, what do you want to do with those
patches? I can re-send them to you now if you want to take them through
your tree, or an ack would be great.

> The '.faults_disabled_mapping' thing is a bit odd, but I don't hate
> it, and I could imagine that other filesystems could possibly use that
> approach instead of the current 'pagefault_disable/enable' games and
> ->nofault games to avoid the whole "use mmap to have the source and
> the destination of a write be the same page" thing.
> 
> So as things stand now, the stuff outside bcachefs itself I don't find
> objectionable.
> 
> The stuff _inside_ bcachefs I care about only in the sense that I
> really *really* would like a locking person to look at the six locks,
> but at the same time as long as it's purely internal to bcachefs and
> doesn't possibly affect anything else, I'm not *too* worried about
> what I see.
> 
> The thing that actually bothers me most about this all is the personal
> arguments I saw.  That I don't know what to do about. I don't actually
> want to merge this over the objections of Christian, now that we have
> a responsible vfs maintainer.

I don't want to do that to Christian either, I think highly of the work
he's been doing and I don't want to be adding to his frustration. So I
apologize for loosing my cool earlier; a lot of that was frustration
from other threads spilling over.

But: if he's going to be raising objections, I need to know what his
concerns are if we're going to get anywhere. Raising objections without
saying what the concerns are shuts down discussion; I don't think it's
unreasonable to ask people not to do that, and to try and stay focused
on the code.

He's got an open invite to the bcachefs meeting, and we were scheduled
to talk Tuesday but he was out sick - anyways, I'm looking forward to
hearing what he has to say.

More broadly, it would make me really happy if we could get certain
people to take a more constructive, "what do we really care about here
and how do we move forward" attitude instead of turning every
interaction into an opportunity to dig their heels in on process and
throw up barriers.

That burns people out, fast. And it's getting to be a problem in
-fsdevel land; I've lost count of the times I've heard Eric Sandeen
complain about how impossible it is to get things merge, and I _really_
hope people are taking notice about Darrick stepping away from XFS and
asking themselves what needs to be sorted out. Darrick writes
meticulous, well documented code; when I think of people who slip by
hacks other people are going to regret later, he's not one of them. And
yet, online fsck for XFS has been pushed back repeatedly because of
petty bullshit.

Scaling laws being what they are, that's a feature we're going to need,
and more importantly XFS cannot afford to lose more people - especially
Darrick.

To speak a bit to what's been driving _me_ a bit nuts in these
discussions, top of my list is that the guy who's been the most
obstinate and argumentative _to this day_ refuses to CC me when touching
code I wrote - and as a result we've had some really nasty bugs (memory
corruption, _silent data corruption_).

So that really needs to change. Let's just please have a little more
focus on not eating people's data, and being more responsible about
bugs.

Anyways, I just want to write the best code I can. That's all I care
about, and I'm always happy to interact with people who share that goal.

Cheers,
Kent

Linus Torvalds Aug. 10, 2023, 4:40 p.m. UTC | #85

On Thu, 10 Aug 2023 at 08:55, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> Heh, I liked the bitfields - I prefer that to open coding structs, which
> is a major pet peeve of mine. But the text size went down a lot a lot
> without them (would like to know why the compiler couldn't constant fold
> all that stuff out, but... not enough to bother).

Bitfields are actually absolutely *horrioble* for many many reasons.
The bit ordering being undefined is just one of them.

Yes, they are convenient syntax, but the downsides of them means that
you should basically only use them for code that has absolutely zero
subtle issues.

Avoid them like the plague with any kind of "data transfer issues", so
in the kernel avoid using them for user interfaces unless you are
willing to deal with the duplication and horror of
__LITTLE_ENDIAN_BITFIELD etc.

Also avoid them if there is any chance of "raw format" issues: either
saving binary data formats, or - as in your original code - using
unions.

As I pointed out your code was actively buggy, because you thought it
was little-endian. That's not even true on little-endian machines (ie
POWERPC is big-endian in bitfields, even when being little-endian in
bytes!).

Finally, as you found out, it also easily generates horrid code. It's
just _harder_ for compilers to do the right thing, particularly when
it's not obvious that other parts of the structure may be "don't care"
because they got initialized earlier (or will be initialized later).
Together with threading requirements, compilers might do a bad job
either because of the complexity, or simply because of subtle
consistency rules.

End result: by all means use bitfields for the *simple* cases where
they are used purely for internal C code with no form of external
exposure, but be aware that even then the syntax convenience easily
comes at a cost.

> > On x86, you'd never see that as an issue, since all writes are
> > releases, so the 'barrier()' compiler ordering ends up forcing the
> > right magic.
>
> Yep, agreed.

But you should realize that on other architectures, I think that
"barrier() + plain write" is actively buggy. On x86 it's safe, but on
arm (and in fact pretty much anything but s390), the barrier() does
nothing in SMP. Yes, it constrains the compiler, but the earlier
writes to remove the entry from the list may happen *later* as far as
other CPUs are concerned.

Which can be a huge problem if the "struct six_lock_waiter" is on the
stack - which I assume it is - and the waiter is just spinning on
w->lock_acquired. The data structure may be re-used as regular stack
space by the time the list removal code happens.

Debugging things like that is a nightmare. And you'll never see it on
x86, and it doesn't look possible when looking at the code, and the
oopses on other architectures will be completely random stack
corruption some time after it got the lock.

So this is kind of why I worry about locking. It's really easy to
write code that works 99.9% of the time, but then breaks when you are
unlucky. And depending on the pattern, the "when you are unlucky" may
or may not be possible on x86. It's not like x86 has total memory
ordering either, it's just stronger than most.

> > Some of the other oddity is around the this_cpu ops, but I suspect
> > that is at least partly then because we don't have acquire/release
> > versions of the local cpu ops that the code looks like it would want.
>
> You mean using full barriers where acquire/release would be sufficient?

Yes.

That code looks like it should work, but be hugely less efficient than
it might be. "smp_mb()" tends to be expensive everywhere, even x86.

Of course, I might be missing some other cases. That percpu reader
queue worries me a bit just because it ends up generating ordering
based on two different things - the lock word _and_ the percpu word.

And I get very nervous if the final "this gets the lock" isn't some
obvious "try_cmpxchg_acquire()" or similar, just because we've
historically had *so* many very subtle bugs in just about every single
lock we've ever had.

> Matthew was planning on sending the iov_iter patch to you - right around
> now, I believe, as a bugfix, since right now
> copy_page_from_iter_atomic() silently does crazy things if you pass it a
> compound page.
>
> Block layer patches aside, are there any _others_ you really want to go
> via maintainers?

It was mainly just the iov and the block layer.

The superblock cases I really don't understand why you insist on just
being different from everybody else.

Your exclusivity arguments make no sense to me. Just open the damn
thing. No other filesystem has ever had the fundamental problems you
describe. You can do any exclusivity test you want in the
"test()/set()" functions passed to sget().

You say that it's a problem because of a "single spinlock", but it
hasn't been a problem for anybody else.

I don't understand why you are so special. The whole problem seems made-up.

> More broadly, it would make me really happy if we could get certain
> people to take a more constructive, "what do we really care about here
> and how do we move forward" attitude instead of turning every
> interaction into an opportunity to dig their heels in on process and
> throw up barriers.

Honestly, I think one huge problem here is that you've been working on
this for a long time (what - a decade by now?) and you've made all
these decisions that you explicitly wanted to be done independently
and intentionally outside the kernel.

And then you feel that "now it's ready to be included", and think that
all those decisions you made outside of the mainline kernel now *have*
to be done that way, and basically sent your first pull request as a
fait-accompli.

The six-locks showed some of that, but as long as they are
bcachefs-internal, I don't much care.

The sget() thing really just smells like "this is how I designed
things, and that's it".

                Linus

Jan Kara Aug. 10, 2023, 5:52 p.m. UTC | #86

On Thu 10-08-23 11:54:53, Kent Overstreet wrote:
> > And there clearly is something very strange going on with superblock
> > handling
> 
> This deserves an explanation because sget() is a bit nutty.
> 
> The way sget() is conventionally used for block device filesystems, the
> block device open _isn't actually exclusive_ - sure, FMODE_EXCL is used,
> but the holder is the fs type pointer, so it won't exclude with other
> opens of the same fs type.
> 
> That means the only protection from multiple opens scribbling over each
> other is sget() itself - but if the bdev handle ever outlives the
> superblock we're completely screwed; that's a silent data corruption bug
> that we can't easily catch, and if the filesystem teardown path has any
> asynchronous stuff going on (and of course it does) that's not a hard
> mistake to make. I've observed at least one bug that looked suspiciously
> like that, but I don't think I quite pinned it down at the time.

This is just being changed - check Christian's VFS tree. There are patches
that make sget() use superblock pointer as a bdev holder so the reuse
you're speaking about isn't a problem anymore.

> It also forces the caller to separate opening of the block devices from
> the rest of filesystem initialization, which is a bit less than ideal.
> 
> Anyways, bcachefs just wants to be able to do real exclusive opens of
> the block devices, and we do all filesystem bringup with a single
> bch2_fs_open() call. I think this could be made to work with the way
> sget() wants to work, but it'd require reworking the locking in
> sget() - it does everything, including the test() and set() calls, under
> a single spinlock.

Yeah. Maybe the current upstream changes aren't enough to make your life
easier for bcachefs, btrfs does its special thing as well after all because
mount also involves multiple devices for it. I just wanted to mention that
the exclusive bdev open thing is changing.

								Honza

Kent Overstreet Aug. 10, 2023, 6:02 p.m. UTC | #87

On Thu, Aug 10, 2023 at 09:40:08AM -0700, Linus Torvalds wrote:
> > > Some of the other oddity is around the this_cpu ops, but I suspect
> > > that is at least partly then because we don't have acquire/release
> > > versions of the local cpu ops that the code looks like it would want.
> >
> > You mean using full barriers where acquire/release would be sufficient?
> 
> Yes.
> 
> That code looks like it should work, but be hugely less efficient than
> it might be. "smp_mb()" tends to be expensive everywhere, even x86.

do_six_unlock_type() doesn't need a full barrier, but I'm not sure we
can avoid the one in __do_six_trylock(), in the percpu reader path.

> Of course, I might be missing some other cases. That percpu reader
> queue worries me a bit just because it ends up generating ordering
> based on two different things - the lock word _and_ the percpu word.
> 
> And I get very nervous if the final "this gets the lock" isn't some
> obvious "try_cmpxchg_acquire()" or similar, just because we've
> historically had *so* many very subtle bugs in just about every single
> lock we've ever had.

kernel/locking/percpu-rwsem.c uses the same idea. The difference is that
percpu-rwsem avoids the memory barrier on the read side in the fast path
at the cost of requiring an rcu barrier on the write side... and all the
crazyness that entails.

But __percpu_down_read_trylock() uses the same algorithm I'm using,
including the same smp_mb(): we need to ensure that the read of the lock
state happens after the store to the percpu read count, and I don't know
how to that without a smp_mb() - smp_store_acquire() isn't a thing.

> > Matthew was planning on sending the iov_iter patch to you - right around
> > now, I believe, as a bugfix, since right now
> > copy_page_from_iter_atomic() silently does crazy things if you pass it a
> > compound page.
> >
> > Block layer patches aside, are there any _others_ you really want to go
> > via maintainers?
> 
> It was mainly just the iov and the block layer.
> 
> The superblock cases I really don't understand why you insist on just
> being different from everybody else.
> 
> Your exclusivity arguments make no sense to me. Just open the damn
> thing. No other filesystem has ever had the fundamental problems you
> describe. You can do any exclusivity test you want in the
> "test()/set()" functions passed to sget().

When using sget() in the conventional way it's not possible for
FMODE_EXCL to protect against concurrent opens scribbling over each
other because we open the block device before checking if it's already
mounted, and we expect that open to succeed.

> You say that it's a problem because of a "single spinlock", but it
> hasn't been a problem for anybody else.

The spinlock means you can't do the actual open in set(), which is why
the block device has to be opened in not-really-exclusive mode.

I think it's be possible to change the locking in sget() so that the
set() callback could do the open, but I haven't looked closely at it.

> and basically sent your first pull request as a fait-accompli.

When did I ever do that?

Linus Torvalds Aug. 10, 2023, 6:09 p.m. UTC | #88

On Thu, 10 Aug 2023 at 11:02, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> When using sget() in the conventional way it's not possible for
> FMODE_EXCL to protect against concurrent opens scribbling over each
> other because we open the block device before checking if it's already
> mounted, and we expect that open to succeed.

So? Read-only operations. Don't write to anything until after you then
have verified your exclusive status.

If you think you need to be exclusive to other people opening the
device for other things, just stop expecting to control the whole
world.

                Linus

Darrick J. Wong Aug. 10, 2023, 10:39 p.m. UTC | #89

On Thu, Aug 10, 2023 at 11:54:53AM -0400, Kent Overstreet wrote:
> Adding Jens to the CC:

<snip to the parts I care most about>

> > and the whole crazy discussion about fput being delayed. It
> > is what it is, and the patches I saw in this thread to not delay them
> > were bad.
> 
> Jens claimed AIO was broken in the same way as io_uring, but it turned
> out that it's not - the test he posted was broken.
> 
> And io_uring really is broken here. Look, the tests that are breaking
> because of this are important ones (generic/388 in particular), and
> those tests are no good to us if they're failing because of io_uring
> crap and Jens is throwing up his hands and saying "trainwreck!" when we
> try to get it fixed.

FWIW I recently fixed all my stupid debian package dependencies so that
I could actually install liburing again, and rebuilt fstests.  The very
next morning I noticed a number of new test failures in /exactly/ the
way that Kent said to expect:

fsstress -d /mnt & <sleep then simulate fs crash>; \
	umount /mnt; mount /dev/sda /mnt

Here, umount exits before the filesystem is really torn down, and then
mount fails because it can't get an exclusive lock on the device.  As a
result, I can't test crash recovery or corrupted metadata shutdowns
because of this delayed fput thing or whatever.  It all worked before
(even with libaio in use) I turned on io_uring.

Obviously, I "fixed" this by modifying fsstress to require explicit
enabling of io_uring operations; everything went back to green after
that.

I'm not familiar enough with the kernel side of io_uring to know what
the solution here is; I'm merely here to provide a second data point.

<snip again>

> > The thing that actually bothers me most about this all is the personal
> > arguments I saw.  That I don't know what to do about. I don't actually
> > want to merge this over the objections of Christian, now that we have
> > a responsible vfs maintainer.
> 
> I don't want to do that to Christian either, I think highly of the work
> he's been doing and I don't want to be adding to his frustration. So I
> apologize for loosing my cool earlier; a lot of that was frustration
> from other threads spilling over.
> 
> But: if he's going to be raising objections, I need to know what his
> concerns are if we're going to get anywhere. Raising objections without
> saying what the concerns are shuts down discussion; I don't think it's
> unreasonable to ask people not to do that, and to try and stay focused
> on the code.

Yeah, I'm also really happy that we have a new/second VFS maintainer.  I
figure it's going to take us a while to help Christian to get past his
fear and horror at the things lurking in fs/ but that's something worth
doing.

(I'm not presuming to know what Christian feels about the VFS; 'fear and
horror' is what *I* feel every time I have to go digging down there.
I'm extrapolating about what I would need, were I a new maintainer, to
get myself to the point where I would have an open enough mind to engage
with new or unfamiliar concepts so that a review cycle for something as
big as bcachefs/online fsck/whatever would be productive.)

> He's got an open invite to the bcachefs meeting, and we were scheduled
> to talk Tuesday but he was out sick - anyways, I'm looking forward to
> hearing what he has to say.
>
> More broadly, it would make me really happy if we could get certain
> people to take a more constructive, "what do we really care about here
> and how do we move forward" attitude

...and "what are all the supporting structures that we need to have in
place to maximize the chances that we'll accomplish those goals"?

> instead of turning every
> interaction into an opportunity to dig their heels in on process and
> throw up barriers.
>
> That burns people out, fast. And it's getting to be a problem in
> -fsdevel land;

Past-participle, not present. :/

I've said this previously, and I'll say it again: we're severely
under-resourced.  Not just XFS, the whole fsdevel community.  As a
developer and later a maintainer, I've learnt the hard way that there is
a very large amount of non-coding work is necessary to build a good
filesystem.  There's enough not-really-coding work for several people.
Instead, we lean hard on maintainers to do all that work.  That might've
worked acceptably for the first 20 years, but it doesn't now.

Nowadays we have all these people running bots and AIs throwing a steady
stream of bug reports and CVE reports at Dave [Chinner] and I.  Most of
these people *do not* help fix the problems they report.  Once in a
while there's an actual *user* report about data loss, but those
(thankfully) aren't the majority of the reports.

However, every one of these reports has to be triaged, analyzed, and
dealt with.  As soon as we clear one, at least one more rolls in.  You
know what that means?  Dave and I are both in a permanent state of
heightened alert, fear, and stress.  We never get to settle back down to
calm.  Every time someone brings up syzbot, CVEs, or security?  I feel
my own stress response ramping up.  I can no longer have "rational"
conversations about syzbot because those discussions push my buttons.

This is not healthy!

Add to that the many demands to backport this and that to dozens of LTS
kernels and distro kernels.  Why do the participation modes for that
seem to be (a) take on an immense amount of backporting work that you
didn't ask for; or (b) let a non-public ML thing pick patches and get
yelled at when it does the wrong thing?  Nobody ever asked me if I
thought the XFS community could support such-and-such LTS kernel.

As the final insult, other people pile on by offering useless opinions
about the maintainers being far behind and unhelpful suggestions that we
engage in a major codebase rewrite.  None of this is helpful.

Dave and I are both burned out.  I'm not sure Dave ever got past the
2017 burnout that lead to his resignation.  Remarkably, he's still
around.  Is this (extended burnout) where I want to be in 2024?  2030?
Hell no.

I still have enough left that I want to help ourselves adapt our culture
to solve these problems.  I tried to get the conversation started with
the maintainer entry profile for XFS that I recently submitted, but that
alone cannot be the final product:
https://lore.kernel.org/linux-xfs/169116629797.3243794.7024231508559123519.stgit@frogsfrogsfrogs/T/#m74bac05414cfba214f5cfa24a0b1e940135e0bed

Being maintainer feels like a punishment, and that cannot stand.
We need help.

People see the kinds of interpersonal interactions going on here and
decide pursue any other career path.  I know so, some have told me
themselves.

You know what's really sad?  Most of my friends work for small
companies, nonprofits, and local governments.  They report the same
problems with overwork, pervasive fear and anger, and struggle to
understand and adapt to new ideas that I observe here.  They see the
direct connection between their org's lack of revenue and the under
resourcedness.

They /don't/ understand why the hell the same happens to me and my
workplace proximity associates, when we all work for companies that
each clear hundreds of billions of dollars in revenue per year.

(Well, they do understand: GREED.  They don't get why we put up with
this situation, or why we don't advocate louder for making things
better.)

> I've lost count of the times I've heard Eric Sandeen
> complain about how impossible it is to get things merge,

A group dynamic that I keep observing around here is that someone tries
to introduce some unfamiliar (or even slightly new) concept, because
they want the kernel to do something it didn't do before.  The author
sends out patches for review, and some of the reviewers who show up
sound like they're so afraid of ... something ... that they throw out
vague arguments that something might break.

[I have had people tell me in private that while they don't have any
specific complaints about online fsck, "something" is wrong and I need
to stop and consider more thoroughly.  Consider /what/?]

Or, worse, no reviewers show up.  The author merges it, and a month
later there's a freakout because something somewhere else broke.  Angry
threads spread around fsdevel because now there's pressure to get it
fixed before -rc8 (in the good case) or ASAP (because now it's
released).  Did the author have an incomplete understanding of the code?
Were there potential reviewers who might've said something but bailed?
Yes and yes.

What do we need to reduce the amount of fear and anger around here,
anyway?  20 years ago when I started my career in Linux I found the work
to be challenging and enjoyable.  Now I see a lot more anger, and I am
sad, because there /are/ still enjoyable challenges to be undertaken.
Can we please have that conversation?

People and groups do not do well when they feel like they're under
constant attack, like they have to brace themselves for whatever
bullshit is coming next.  That is how I feel most weeks, and I choose
not to do that anymore.

> and I _really_
> hope people are taking notice about Darrick stepping away from XFS and
> asking themselves what needs to be sorted out.

Me too.  Ted expressed similar laments about ext4 after I announced my
intention to reduce my own commitments to XFS.

> Darrick writes
> meticulous, well documented code; when I think of people who slip by
> hacks other people are going to regret later, he's not one of them.

I appreciate the compliment. ;)

From what I can tell (because I lolquit and finally had time to start
scanning the bcachefs code) I really like the thought that you've put
into indexing and record iteration in the filesystem.  I appreciate the
amount of work you've put into making it easy and fast to run QA on
bcachefs, even if we don't quite agree on whether or not I should rip
and replace my 20yo Debian crazyquilt.

> And yet, online fsck for XFS has been pushed back repeatedly because
> of petty bullshit.

A broader dynamic here is that I ask people to review the code so that I
can merge it; they say they will do it; and then an entire cycle goes by
without any visible progress.

When I ask these people why they didn't follow through on their
commitments, the responses I hear are pretty uniform -- they got buried
in root cause analysis of a real bug report but lol there were no other
senior people available; their time ended up being spent on backports or
arguing about backports; or they got caught up in that whole freakout
thing I described above.

> Scaling laws being what they are, that's a feature we're going to need,
> and more importantly XFS cannot afford to lose more people - especially
> Darrick.

While I was maintainer I lobbied managers at Oracle and Google and RH to
hire new people to grow the size of the XFS community, and they did.
That was awesome!  It's not so hard to help managers come up with
business justifications for headcount for critical pieces of their
products*.

But.

For 2023 XFS is already down 2 people + whatever the hell I was doing
that isn't "trying to get online fsck merged".  We're still at +1, but
still who's going to replace us oldtimers?

--D

* But f*** impossible to get that done when it's someone's 20% project
  causing a lot of friction on the mailing lists.

> To speak a bit to what's been driving _me_ a bit nuts in these
> discussions, top of my list is that the guy who's been the most
> obstinate and argumentative _to this day_ refuses to CC me when touching
> code I wrote - and as a result we've had some really nasty bugs (memory
> corruption, _silent data corruption_).
> 
> So that really needs to change. Let's just please have a little more
> focus on not eating people's data, and being more responsible about
> bugs.
> 
> Anyways, I just want to write the best code I can. That's all I care
> about, and I'm always happy to interact with people who share that goal.
> 
> Cheers,
> Kent

Matthew Wilcox Aug. 10, 2023, 11:07 p.m. UTC | #90

On Thu, Aug 10, 2023 at 11:54:53AM -0400, Kent Overstreet wrote:
> Matthew was planning on sending the iov_iter patch to you - right around
> now, I believe, as a bugfix, since right now
> copy_page_from_iter_atomic() silently does crazy things if you pass it a
> compound page.

That's currently sitting in Darrick's iomap tree, commit 908a1ad89466
"iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic()"

It's based on 6.5-rc3, so it would be entirely possible for Darrick
to send Linus a pull request for 908a1ad89466 ... or you could pull
in 908a1ad89466.  I'll talk to Darrick tomorrow.

Linus Torvalds Aug. 10, 2023, 11:47 p.m. UTC | #91

On Thu, 10 Aug 2023 at 15:39, Darrick J. Wong <djwong@kernel.org> wrote:
>
> FWIW I recently fixed all my stupid debian package dependencies so that
> I could actually install liburing again, and rebuilt fstests.  The very
> next morning I noticed a number of new test failures in /exactly/ the
> way that Kent said to expect:
>
> fsstress -d /mnt & <sleep then simulate fs crash>; \
>         umount /mnt; mount /dev/sda /mnt
>
> Here, umount exits before the filesystem is really torn down, and then
> mount fails because it can't get an exclusive lock on the device.

I agree that that obviously sounds like mount is just returning either
too early. Or too eagerly.

But I suspect any delayed fput() issues (whether from aio or io_uring)
are then just a way to trigger the problem, not the fundamental cause.

Because even if the fput() is delayed, the mntput() part of that
delayed __fput action is the one that *should* have kept the
filesystem mounted until it is no longer busy.

And more importantly, having some of the common paths synchronize
*their* fput() calls only affects those paths.

It doesn't affect the fundamental issue that the last fput() can
happen in odd contexts when the file descriptor was used for something
a bit stranger.

So I do feel like the fput patch I saw looked more like a "hide the
problem" than a real fix.

Put another way: I would not be surprised in the *least* if then
adding more synchronization to fput would basically hide any issue,
particularly from tests that then use those things that you added
synchronization for.

But it really smells like it's purely hiding the symptom to me.

If I were a betting man, I'd look at ->mnt_count.  I'm not saying
that's the problem, but the mnt refcount handling is more than a bit
scary.

It is so hugely performance-critical (ie every single path access)
that we use those percpu counters for it, and I'm not at all sure it's
all race-free.

Just as an example, mnt_add_count() has this comment above it:

 * vfsmount lock must be held for read

but afaik the main way it gets called is through mntget(), and I see
no vfsmount lock held anywhere there (think "path_get() and friends).
Maybe I'm missing something obvious.

So I think that comment is some historical left-over that hasn't been
true for ages.

And all of the counter updates should be consistent even in the
absence of said lock, so it's not an issue.

Except when it is: it does look like it *would* screw up
mnt_get_count() that tries to add up all those percput counters with

        for_each_possible_cpu(cpu) {
                count += per_cpu_ptr(mnt->mnt_pcp, cpu)->mnt_count;
        }

and that one has that

 * vfsmount lock must be held for write

comment, which makes sense as a "that would indeed synchronize if
others held it for read". But...

And where is that sum used? Very much in things like may_umount_tree().

Anyway, I'm absolutely not saying this is the actual problem - we
probably at least partly just have stale or incomplete comments, and
maybe I think the fput() side is good mainly because I'm *much* more
familiar with that side than I am with the actual mount code these
days.

So I might be barking up entirely the wrong tree.

But I do feel like the fput patch I saw looked more like a "hide the
problem" than a real fix. Because the mount counting *should* be
entirely independent of when exactly a fput happens.

So I believe having the test-case then do some common fput's
synchronously pretty much by definition can't fix any issues, but it
*can* make sure that any normal test using just regular system calls
then never triggers the "oh, in other situations the last fput will be
delayed".

So that's why I'm very leery of the fput patch I saw. I don't think it
makes sense.

That does *not* mean that I don't believe that umount can have serious problems.

I suspect we get very little coverage of that in normal situations.

And yes, obviously io_uring does add *way* more asynchronicity, and
I'd not be surprised at all if it uncovers problems.

In most other situations, the main source of file counts are purely
open/close system calls, which are in many ways "simple" (and where
things like process exit obviously does the closing part).

                Linus

Jens Axboe Aug. 11, 2023, 2:40 a.m. UTC | #92

On 8/10/23 5:47 PM, Linus Torvalds wrote:
> On Thu, 10 Aug 2023 at 15:39, Darrick J. Wong <djwong@kernel.org> wrote:
>>
>> FWIW I recently fixed all my stupid debian package dependencies so that
>> I could actually install liburing again, and rebuilt fstests.  The very
>> next morning I noticed a number of new test failures in /exactly/ the
>> way that Kent said to expect:
>>
>> fsstress -d /mnt & <sleep then simulate fs crash>; \
>>         umount /mnt; mount /dev/sda /mnt
>>
>> Here, umount exits before the filesystem is really torn down, and then
>> mount fails because it can't get an exclusive lock on the device.
> 
> I agree that that obviously sounds like mount is just returning either
> too early. Or too eagerly.
> 
> But I suspect any delayed fput() issues (whether from aio or io_uring)
> are then just a way to trigger the problem, not the fundamental cause.
> 
> Because even if the fput() is delayed, the mntput() part of that
> delayed __fput action is the one that *should* have kept the
> filesystem mounted until it is no longer busy.
> 
> And more importantly, having some of the common paths synchronize
> *their* fput() calls only affects those paths.
> 
> It doesn't affect the fundamental issue that the last fput() can
> happen in odd contexts when the file descriptor was used for something
> a bit stranger.
> 
> So I do feel like the fput patch I saw looked more like a "hide the
> problem" than a real fix.

The fput patch was not pretty, nor is it needed. What happens on the
io_uring side is that pending requests (which can hold files referenced)
are canceled on exit. But we don't wait for the references to go away,
which then introduces this race.

I've used this to trigger it:

#!/bin/bash

DEV=/dev/nvme0n1
MNT=/data
ITER=0

while true; do
	echo loop $ITER
	sudo mount $DEV $MNT
	fio --name=test --ioengine=io_uring --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --thread=1 --output=/dev/null &
	Y=$(($RANDOM % 3))
	X=$(($RANDOM % 10))
	VAL="$Y.$X"
	sleep $VAL
	ps -e | grep fio > /dev/null 2>&1
	while [ $? -eq 0 ]; do
		killall -9 fio > /dev/null 2>&1
		wait > /dev/null 2>&1
		ps -e | grep "fio " > /dev/null 2>&1
	done
	sudo umount /data
	if [ $? -ne 0 ]; then
		break
	fi
	((ITER++))
done

and can make it happen pretty easily, within a few iterations.

Contrary to how it was otherwise presented in this thread, I did take a
look at this a month ago and wrote up some patches for it. Just rebased
them on the current tree:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-exit-cancel

Since we have task_work involved for both the completions and the
__fput(), ordering is a concern which is why it needs a bit more effort
than just the bare bones stuff. The way the task_work list works, we
llist_del_all() and run all items. But we do encapsulate that in
io_uring anyway, so it's possible to run our pending local items and
avoid that particular snag.

WIP obviously, the first 3-4 prep patches were posted earlier today, but
I'm not happy with the last 3 yet in the above branch. Or at least not
fully confident, so will need a bit more thinking and testing. Does pass
the above test case, and the regular liburing test/regression cases,
though.

Kent Overstreet Aug. 11, 2023, 2:47 a.m. UTC | #93

On Thu, Aug 10, 2023 at 07:52:05PM +0200, Jan Kara wrote:
> On Thu 10-08-23 11:54:53, Kent Overstreet wrote:
> > > And there clearly is something very strange going on with superblock
> > > handling
> > 
> > This deserves an explanation because sget() is a bit nutty.
> > 
> > The way sget() is conventionally used for block device filesystems, the
> > block device open _isn't actually exclusive_ - sure, FMODE_EXCL is used,
> > but the holder is the fs type pointer, so it won't exclude with other
> > opens of the same fs type.
> > 
> > That means the only protection from multiple opens scribbling over each
> > other is sget() itself - but if the bdev handle ever outlives the
> > superblock we're completely screwed; that's a silent data corruption bug
> > that we can't easily catch, and if the filesystem teardown path has any
> > asynchronous stuff going on (and of course it does) that's not a hard
> > mistake to make. I've observed at least one bug that looked suspiciously
> > like that, but I don't think I quite pinned it down at the time.
> 
> This is just being changed - check Christian's VFS tree. There are patches
> that make sget() use superblock pointer as a bdev holder so the reuse
> you're speaking about isn't a problem anymore.

So then the question is what do you use for identifying the superblock,
and you're switching to the dev_t - interesting.

Are we 100% sure that will never break, that a dev_t will always
identify a unique block_device? Namespacing has been changing things.

> > It also forces the caller to separate opening of the block devices from
> > the rest of filesystem initialization, which is a bit less than ideal.
> > 
> > Anyways, bcachefs just wants to be able to do real exclusive opens of
> > the block devices, and we do all filesystem bringup with a single
> > bch2_fs_open() call. I think this could be made to work with the way
> > sget() wants to work, but it'd require reworking the locking in
> > sget() - it does everything, including the test() and set() calls, under
> > a single spinlock.
> 
> Yeah. Maybe the current upstream changes aren't enough to make your life
> easier for bcachefs, btrfs does its special thing as well after all because
> mount also involves multiple devices for it. I just wanted to mention that
> the exclusive bdev open thing is changing.

I like the mount_bdev() approach in your patch a _lot_ better than the
old code, I think the approach almost works for multi device
filesystems - at least for bcachefs where we always pass in the full
list of devices we want to open, there's no kernel side probing like in
btrfs.

What changes is we'd have to pass a vector of dev_t's to sget(), and
set() needs to be able to stash them in super_block (not s_fs_info, we
need that for bch_fs later and that doesn't exist yet). But that's a
minor detail.

Yeah, this could work.

Kent Overstreet Aug. 11, 2023, 3:45 a.m. UTC | #94

On Thu, Aug 10, 2023 at 03:39:42PM -0700, Darrick J. Wong wrote:
> I've said this previously, and I'll say it again: we're severely
> under-resourced.  Not just XFS, the whole fsdevel community.  As a
> developer and later a maintainer, I've learnt the hard way that there is
> a very large amount of non-coding work is necessary to build a good
> filesystem.  There's enough not-really-coding work for several people.
> Instead, we lean hard on maintainers to do all that work.  That might've
> worked acceptably for the first 20 years, but it doesn't now.

Yeah, that was my takeaway too when I started doing some more travelling
last fall to talk to people about bcachefs - the teams are not what they
were 10 years ago, and a lot of the effort in the filesystem space feels
a lot more fragmented. It feels like there's a real lack of leadership
or any kind of a long term plan in the filesystem space, and I think
that's one of the causes of all the burnout; we don't have a clear set
of priorities or long term goals.

> Nowadays we have all these people running bots and AIs throwing a steady
> stream of bug reports and CVE reports at Dave [Chinner] and I.  Most of
> these people *do not* help fix the problems they report.  Once in a
> while there's an actual *user* report about data loss, but those
> (thankfully) aren't the majority of the reports.
> 
> However, every one of these reports has to be triaged, analyzed, and
> dealt with.  As soon as we clear one, at least one more rolls in.  You
> know what that means?  Dave and I are both in a permanent state of
> heightened alert, fear, and stress.  We never get to settle back down to
> calm.  Every time someone brings up syzbot, CVEs, or security?  I feel
> my own stress response ramping up.  I can no longer have "rational"
> conversations about syzbot because those discussions push my buttons.
> 
> This is not healthy!

Yeah, we really need to take a step back and ask ourselves what we're
trying to do here.

At this point, I'm not so sure hardening xfs/ext4 in all the ways people
are wanting them to be hardened is a realistic idea: these are huge, old
C codebases that are tricky to work on, and they weren't designed from
the start with these kinds of considerations. Yes, in a perfect world
all code should be secure and all bugs should be fixed, but is this the
way to do it?

Personally, I think we'd be better served by putting what manpower we
can spare into starting on an incremental Rust rewrite; at least that's
my plan for bcachefs, and something I've been studying for awhile (as
soon as the gcc rust stuff lands I'll be adding Rust code to
fs/bcachefs, some code already exists). For xfs/ext4, teasing things
apart and figuring out how to restructure data structures in a way to
pass the borrow checker may not be realistic, I don't know the codebases
well enough to say - but clearly the current approach is not working,
and these codebases are almost definitely still going to be in use 50
years from now, we need to be coming up with _some_ sort of plan.

And if we had a coherent long term plan, maybe that would help with the
funding and headcount issues...

> A group dynamic that I keep observing around here is that someone tries
> to introduce some unfamiliar (or even slightly new) concept, because
> they want the kernel to do something it didn't do before.  The author
> sends out patches for review, and some of the reviewers who show up
> sound like they're so afraid of ... something ... that they throw out
> vague arguments that something might break.
> 
> [I have had people tell me in private that while they don't have any
> specific complaints about online fsck, "something" is wrong and I need
> to stop and consider more thoroughly.  Consider /what/?]

Yup, that's just broken. If you're telling someone they're doing it
wrong and you're not offering up any ideas, maybe _you're_ the problem.

The fear based thing is very real, and _very_ understandable. In the
filesystem world, we have to live with our mistakes in a way no one else
in kernel land does. There's no worse feeling than realizing you fucked
up something in the on disk format, and you didn't realize it until six
months later, and now you've got incompatibilities that are a nightmare
to sort out - never mind the more banal "oh fuck, sorry I ate your data"
stories.

> Or, worse, no reviewers show up.  The author merges it, and a month
> later there's a freakout because something somewhere else broke.  Angry
> threads spread around fsdevel because now there's pressure to get it
> fixed before -rc8 (in the good case) or ASAP (because now it's
> released).  Did the author have an incomplete understanding of the code?
> Were there potential reviewers who might've said something but bailed?
> Yes and yes.
> 
> What do we need to reduce the amount of fear and anger around here,
> anyway?  20 years ago when I started my career in Linux I found the work
> to be challenging and enjoyable.  Now I see a lot more anger, and I am
> sad, because there /are/ still enjoyable challenges to be undertaken.
> Can we please have that conversation?

I've been through the burnout cycle too (many times!), and for me the
answer was: slow down, and identify the things that really matter, the
things that will make my life easier in the long run, and focus on
_that_.

I've been through cycles more than once where I wasn't keeping up with
bug reports, and I had to tell my users "hang on - this isn't efficient,
I need to work on the testing automation because stuff is slipping
through; give me a month".

(And also make sure to leave some time for the things I actually do
enjoy; right now that means working on the fuse port here and there).

> People and groups do not do well when they feel like they're under
> constant attack, like they have to brace themselves for whatever
> bullshit is coming next.  That is how I feel most weeks, and I choose
> not to do that anymore.
> 
> > and I _really_
> > hope people are taking notice about Darrick stepping away from XFS and
> > asking themselves what needs to be sorted out.
> 
> Me too.  Ted expressed similar laments about ext4 after I announced my
> intention to reduce my own commitments to XFS.

Oh man, we can't lose Ted.

> > Darrick writes
> > meticulous, well documented code; when I think of people who slip by
> > hacks other people are going to regret later, he's not one of them.
> 
> I appreciate the compliment. ;)
> 
> From what I can tell (because I lolquit and finally had time to start
> scanning the bcachefs code) I really like the thought that you've put
> into indexing and record iteration in the filesystem.  I appreciate the
> amount of work you've put into making it easy and fast to run QA on
> bcachefs, even if we don't quite agree on whether or not I should rip
> and replace my 20yo Debian crazyquilt.

Thanks, the database layer is something I've put a _ton_ of work into. I
feel like we're close to being able to get into some really exciting
stuff once we get past the "stabilizing a new filesystem with a massive
featureset" madness - people have been trying to do the
filesystem-as-a-database thing for years, and I think bcachefs is the
first to actually seriously pull it off.

And I'm really hoping to make the test infrastructure its own real
project for the whole fs community, and more. There's a lot of good
stuff in there I just need to document better and create a proper
website for.

> > And yet, online fsck for XFS has been pushed back repeatedly because
> > of petty bullshit.
> 
> A broader dynamic here is that I ask people to review the code so that I
> can merge it; they say they will do it; and then an entire cycle goes by
> without any visible progress.
> 
> When I ask these people why they didn't follow through on their
> commitments, the responses I hear are pretty uniform -- they got buried
> in root cause analysis of a real bug report but lol there were no other
> senior people available; their time ended up being spent on backports or
> arguing about backports; or they got caught up in that whole freakout
> thing I described above.

Yeah, that set of priorities makes sense when we're talking about
patches that modify existing code; if you can't keep up with bug reports
then you have to slow down on changes, and changes to existing code
often do need the meticulous review - and hopefully while people are
waiting on code review they'll be helping out with bug reports.

But for new code that isn't going to upset existing users, if we trust
the author to not do crazy things then code review is really more about
making sure someone else understands the code. But if they're putting in
all the proper effort to document, to organize things well, to do things
responsibly, does it make sense for that level of code review to be an
up front requirement? Perhaps we could think a _bit_ more about how we
enable people to do good work.

I'm sure the XFS people have thought about this more than I have, but
given how long this has been taking you and the amount of pushback I
feel it ought to be asked.

Kent Overstreet Aug. 11, 2023, 4:03 a.m. UTC | #95

On Thu, Aug 10, 2023 at 04:47:22PM -0700, Linus Torvalds wrote:
> So I might be barking up entirely the wrong tree.

Yeah, I think you are, it sounds like you're describing an entirely
different sort of race.

The issue here is just that killing off a process should release all the
references it holds, and if we kill off all processes accessing a
filesystem we should be able to unmount it - but in this case we can't,
because fputs() are being delayed asynchronously.

delayed_fput() from AIO turned out to not be an issue in my testing, for
reasons that are unclear to me; flush_delayed_fput() certainly isn't
called in any relevant codepaths. The code _looks_ buggy to me, but I
wasn't able to trigger the bug with AIO.

io_uring adds its own layer of indirect asynchronous reference holding,
and that's why the issue crops up there - but io_uring isn't using
delayed_fput() either.

The patch I posted was to make sure the file ref doesn't outlive the
task - I honestly don't know what you and Jens don't like about that
approach (obviously, adding task->ref gets and puts to fastpaths is a
nonstarter, but that's fixable as mentioned).

Linus Torvalds Aug. 11, 2023, 5:20 a.m. UTC | #96

On Thu, 10 Aug 2023 at 21:03, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> On Thu, Aug 10, 2023 at 04:47:22PM -0700, Linus Torvalds wrote:
> > So I might be barking up entirely the wrong tree.
>
> Yeah, I think you are, it sounds like you're describing an entirely
> different sort of race.

I was just going by Darrick's description of what he saw, which
*seemed* to be that umount had finished with stuff still active:

  "Here, umount exits before the filesystem is really torn down, and then
  mount fails because it can't get an exclusive lock on the device."

But maybe I misunderstood, and the umount wasn't actually successful
(ie "exits" may have been "failed with EBUSY")?

So I was trying to figure out what could cause the behavior I thought
Darrick was describing, which would imply a mnt_count issue.

If it's purely "umount doesnt' succeed because the filesystem is still
busy with cleanups", then things are much better.

The mnt_count is nasty, if it's not that, we're actually much better
off, and I'll be very happy to have misunderstood Darrick's case.

              Linus

Kent Overstreet Aug. 11, 2023, 5:29 a.m. UTC | #97

On Thu, Aug 10, 2023 at 10:20:22PM -0700, Linus Torvalds wrote:
> If it's purely "umount doesnt' succeed because the filesystem is still
> busy with cleanups", then things are much better.

That's exactly it. We have various tests that kill -9 fio and then
umount, and umount spuriously fails.

Linus Torvalds Aug. 11, 2023, 5:53 a.m. UTC | #98

On Thu, 10 Aug 2023 at 22:29, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> On Thu, Aug 10, 2023 at 10:20:22PM -0700, Linus Torvalds wrote:
> > If it's purely "umount doesnt' succeed because the filesystem is still
> > busy with cleanups", then things are much better.
>
> That's exactly it. We have various tests that kill -9 fio and then
> umount, and umount spuriously fails.

Well, it sounds like Jens already has some handle on at least one
io_uring shutdown case that didn't wait for completion.

At the same time, a random -EBUSY is kind of an expected failure in
real life, since outside of strictly controlled environments you could
easily have just some entirely unrelated thing that just happens to
have looked at the filesystem when you tried to unmount it.

So any real-life use tends to use umount in a (limited) loop. It might
just make sense for the fsstress test scripts to do the same
regardless.

There's no actual good reason to think that -EBUSY is a hard error. It
very much can be transient.

In fact, I have this horrible flash-back memory to some auto-expiry
scripts that used to do the equivalent of "umount -a -t autofs" every
minute or so as a horrible model for expiring things, happy and secure
in the knowledge that if the filesystem was still in active use, it
would just fail.

So may I suggest that even if the immediate issue ends up being sorted
out, just from a robustness standpoint the "consider EBUSY a hard
error" seems to be a mistake.

Transient failures are pretty much expected, and not all of them are
necessarily kernel-related (ie think "concurrent updatedb run" or any
number of other possibilities).

          Linus

Christian Brauner Aug. 11, 2023, 7:52 a.m. UTC | #99

> So may I suggest that even if the immediate issue ends up being sorted
> out, just from a robustness standpoint the "consider EBUSY a hard
> error" seems to be a mistake.

Especially from umount. The point I was trying to make in the other
thread is that this needs fixing in the subsystem that's causing
_unnecessary_ spurious EBUSY errors and Jens has been at his right
away. What we don't want is for successful umount to be equated with
that an immediate mount can never return EBUSY again. I think that's not
a guarantee that umount should give and with mount namespaces in the mix
you can get your filesystem pinned implicitly somewhere behind your back
without you ever noticing it as just one very obvious example.

> 
> Transient failures are pretty much expected

Yes, I agree.

Jan Kara Aug. 11, 2023, 8:10 a.m. UTC | #100

On Thu 10-08-23 22:47:03, Kent Overstreet wrote:
> On Thu, Aug 10, 2023 at 07:52:05PM +0200, Jan Kara wrote:
> > On Thu 10-08-23 11:54:53, Kent Overstreet wrote:
> > > > And there clearly is something very strange going on with superblock
> > > > handling
> > > 
> > > This deserves an explanation because sget() is a bit nutty.
> > > 
> > > The way sget() is conventionally used for block device filesystems, the
> > > block device open _isn't actually exclusive_ - sure, FMODE_EXCL is used,
> > > but the holder is the fs type pointer, so it won't exclude with other
> > > opens of the same fs type.
> > > 
> > > That means the only protection from multiple opens scribbling over each
> > > other is sget() itself - but if the bdev handle ever outlives the
> > > superblock we're completely screwed; that's a silent data corruption bug
> > > that we can't easily catch, and if the filesystem teardown path has any
> > > asynchronous stuff going on (and of course it does) that's not a hard
> > > mistake to make. I've observed at least one bug that looked suspiciously
> > > like that, but I don't think I quite pinned it down at the time.
> > 
> > This is just being changed - check Christian's VFS tree. There are patches
> > that make sget() use superblock pointer as a bdev holder so the reuse
> > you're speaking about isn't a problem anymore.
> 
> So then the question is what do you use for identifying the superblock,
> and you're switching to the dev_t - interesting.
> 
> Are we 100% sure that will never break, that a dev_t will always
> identify a unique block_device? Namespacing has been changing things.

Yes, dev_t is a unique identifier of the device, we rely on that in
multiple places, block device open comes to mind as the first. You're
right namespacing changes things but we implement that as changing what
gets presented to userspace via some mapping layer while the kernel keeps
using globally unique identifiers.

								Honza

Christian Brauner Aug. 11, 2023, 8:13 a.m. UTC | #101

On Fri, Aug 11, 2023 at 10:10:42AM +0200, Jan Kara wrote:
> On Thu 10-08-23 22:47:03, Kent Overstreet wrote:
> > On Thu, Aug 10, 2023 at 07:52:05PM +0200, Jan Kara wrote:
> > > On Thu 10-08-23 11:54:53, Kent Overstreet wrote:
> > > > > And there clearly is something very strange going on with superblock
> > > > > handling
> > > > 
> > > > This deserves an explanation because sget() is a bit nutty.
> > > > 
> > > > The way sget() is conventionally used for block device filesystems, the
> > > > block device open _isn't actually exclusive_ - sure, FMODE_EXCL is used,
> > > > but the holder is the fs type pointer, so it won't exclude with other
> > > > opens of the same fs type.
> > > > 
> > > > That means the only protection from multiple opens scribbling over each
> > > > other is sget() itself - but if the bdev handle ever outlives the
> > > > superblock we're completely screwed; that's a silent data corruption bug
> > > > that we can't easily catch, and if the filesystem teardown path has any
> > > > asynchronous stuff going on (and of course it does) that's not a hard
> > > > mistake to make. I've observed at least one bug that looked suspiciously
> > > > like that, but I don't think I quite pinned it down at the time.
> > > 
> > > This is just being changed - check Christian's VFS tree. There are patches
> > > that make sget() use superblock pointer as a bdev holder so the reuse
> > > you're speaking about isn't a problem anymore.
> > 
> > So then the question is what do you use for identifying the superblock,
> > and you're switching to the dev_t - interesting.
> > 
> > Are we 100% sure that will never break, that a dev_t will always
> > identify a unique block_device? Namespacing has been changing things.
> 
> Yes, dev_t is a unique identifier of the device, we rely on that in
> multiple places, block device open comes to mind as the first. You're
> right namespacing changes things but we implement that as changing what
> gets presented to userspace via some mapping layer while the kernel keeps
> using globally unique identifiers.

Full device namespacing is not on the horizon at all. We've looked into
this years ago and it woud be a giant effort that would effect nearly
everything if the properly. So even if, there would be so many changes
required that reliance on dev_t in the VFS would be the least of our
problems.

Christian Brauner Aug. 11, 2023, 10:54 a.m. UTC | #102

> I don't want to do that to Christian either, I think highly of the work
> he's been doing and I don't want to be adding to his frustration. So I
> apologize for loosing my cool earlier; a lot of that was frustration
> from other threads spilling over.
> 
> But: if he's going to be raising objections, I need to know what his
> concerns are if we're going to get anywhere. Raising objections without
> saying what the concerns are shuts down discussion; I don't think it's
> unreasonable to ask people not to do that, and to try and stay focused
> on the code.

The technical aspects were made clear off-list and I believe multiple
times on-list by now. Any VFS and block related patches are to be
reviewed and accepted before bcachefs gets merged.

This was also clarified off-list before the pull request was sent. Yet,
it was sent anyway.

On the receiving end this feels disrespectful. To other maintainers this
implies you only accept Linus verdict and expect him to ignore
objections of other maintainers and pull it all in. That would've caused
massive amounts of frustration and conflict should that have happened.
So this whole pull request had massive potential to divide the
community. And in the end you were told the same requirements that we
did have and then you accepted it but that cannot be the only barrier
that you accept.

And it's not just all about code. Especially from a maintainer's
perspective. There's two lengthy mails from Darrick and from you with
detailed excursions about social aspects as well.

Social aspects in fact often come into the center whenever we focus on
code. There will be changes that a sub-maintainer may think are
absolutely required and that infrastructure maintainers will reject for
reasons that the sub-maintainer might fundamentally disagree with and we
need to be confident that a maintainer can handle this gracefully and
respectfully. If there's strong indication to the contrary it's a
problem that can't be ignored.

To address this issue I did request at LSFMM that I want a co-maintainer
for bcachefs that can act as a counterweight and balancing factor. Not
just a reviewer but someone who is designated in making decisions in
addition to you and can step in. That would be may preferred thing.

Timeline wise, my preference would be if we could get the time to finish
the super work that Christoph and Jan are currently doing and have a
cycle to see how badly the world breaks. And then we aim to merge
bcachefs for v6.7 in November. That's really not far away and also gives
everyone the time to calm down a little.

Kent Overstreet Aug. 11, 2023, 12:58 p.m. UTC | #103

On Fri, Aug 11, 2023 at 12:54:42PM +0200, Christian Brauner wrote:
> The technical aspects were made clear off-list and I believe multiple
> times on-list by now. Any VFS and block related patches are to be
> reviewed and accepted before bcachefs gets merged.

Christian, you're misrepresenting.

The fact is, the _very same person_ who has been most vocal in saying
"all patches need to go in prior through maintainers" was also in years
past one of the people saying that patches only for bcachefs shouldn't
go in until the bcachefs pull. And as well, we also had Linus just
looking at the prereq series and saying acks would be fine from Jens.

> This was also clarified off-list before the pull request was sent. Yet,
> it was sent anyway.

All these patches have hit the list multiple times; the one VFS patch is
question is a tiny new helper and it's been in your inbox.

> On the receiving end this feels disrespectful. To other maintainers this
> implies you only accept Linus verdict and expect him to ignore
> objections of other maintainers and pull it all in.

Well, it is his kernel :)

And more than that, I find Linus genuinely more pleasant to deal with; I
always feel like I'm talking to someone who's just trying to have an
intelligent conversation and doesn't want to waste time on bullshit.

Look, in the private pre-pull request thread, within _hours_ he was
tearing into six locks and the statistics code.

I post that same code to the locking mailing list, and I got - what, a
couple comments to clarify? A spelling mistake pointed out?

So yeah, I appreciate hearing from him.

The code's been out on the mailing list for months and you haven't
commented at all. All I need from you is an ack on the dcache helper or
a comment saying why you don't like it, and all I'm getting is
complaints.

> That would've caused massive amounts of frustration and conflict
> should that have happened. So this whole pull request had massive
> potential to divide the community.

Christian, I've been repeatedly asking what your concerns are: we had
_two_ meetings set up for you that you noshow'd on. And here you are
continuing to make wild conflicts about frustration and conflict, but
you can't seem to name anything specific.

I don't want to make your life more difficult, but you seem to want to
make _mine_ more difficult. You made one offhand comment about not
wanting a repeat of ntfs3, and when I asked you for details you never
even responded.

> Timeline wise, my preference would be if we could get the time to finish
> the super work that Christoph and Jan are currently doing and have a
> cycle to see how badly the world breaks. And then we aim to merge
> bcachefs for v6.7 in November. That's really not far away and also gives
> everyone the time to calm down a little.

I don't see the justification for the delay - every cycle there's some
amount of vfs/block layer refactoring that affects filesystems, the
super work is no different.

Kent Overstreet Aug. 11, 2023, 1:21 p.m. UTC | #104

On Fri, Aug 11, 2023 at 12:54:42PM +0200, Christian Brauner wrote:
> > I don't want to do that to Christian either, I think highly of the work
> > he's been doing and I don't want to be adding to his frustration. So I
> > apologize for loosing my cool earlier; a lot of that was frustration
> > from other threads spilling over.
> > 
> > But: if he's going to be raising objections, I need to know what his
> > concerns are if we're going to get anywhere. Raising objections without
> > saying what the concerns are shuts down discussion; I don't think it's
> > unreasonable to ask people not to do that, and to try and stay focused
> > on the code.
> 
> The technical aspects were made clear off-list and I believe multiple
> times on-list by now. Any VFS and block related patches are to be
> reviewed and accepted before bcachefs gets merged.

Here's the one VFS patch in the series - could we at least get an ack
for this? It's a new helper, just breaks the existing d_tmpfile() up
into two functions - I hope we can at least agree that this patch
shouldn't be controversial?

-->--
Subject: [PATCH] fs: factor out d_mark_tmpfile()

New helper for bcachefs - bcachefs doesn't want the
inode_dec_link_count() call that d_tmpfile does, it handles i_nlink on
its own atomically with other btree updates

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org

diff --git a/fs/dcache.c b/fs/dcache.c
index 52e6d5fdab..dbdafa2617 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3249,11 +3249,10 @@ void d_genocide(struct dentry *parent)
 
 EXPORT_SYMBOL(d_genocide);
 
-void d_tmpfile(struct file *file, struct inode *inode)
+void d_mark_tmpfile(struct file *file, struct inode *inode)
 {
 	struct dentry *dentry = file->f_path.dentry;
 
-	inode_dec_link_count(inode);
 	BUG_ON(dentry->d_name.name != dentry->d_iname ||
 		!hlist_unhashed(&dentry->d_u.d_alias) ||
 		!d_unlinked(dentry));
@@ -3263,6 +3262,15 @@ void d_tmpfile(struct file *file, struct inode *inode)
 				(unsigned long long)inode->i_ino);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dentry->d_parent->d_lock);
+}
+EXPORT_SYMBOL(d_mark_tmpfile);
+
+void d_tmpfile(struct file *file, struct inode *inode)
+{
+	struct dentry *dentry = file->f_path.dentry;
+
+	inode_dec_link_count(inode);
+	d_mark_tmpfile(file, inode);
 	d_instantiate(dentry, inode);
 }
 EXPORT_SYMBOL(d_tmpfile);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 6b351e009f..3da2f0545d 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -251,6 +251,7 @@ extern struct dentry * d_make_root(struct inode *);
 /* <clickety>-<click> the ramfs-type tree */
 extern void d_genocide(struct dentry *);
 
+extern void d_mark_tmpfile(struct file *, struct inode *);
 extern void d_tmpfile(struct file *, struct inode *);
 
 extern struct dentry *d_find_alias(struct inode *);

Jens Axboe Aug. 11, 2023, 2:31 p.m. UTC | #105

On 8/10/23 11:53 PM, Linus Torvalds wrote:
> On Thu, 10 Aug 2023 at 22:29, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>>
>> On Thu, Aug 10, 2023 at 10:20:22PM -0700, Linus Torvalds wrote:
>>> If it's purely "umount doesnt' succeed because the filesystem is still
>>> busy with cleanups", then things are much better.
>>
>> That's exactly it. We have various tests that kill -9 fio and then
>> umount, and umount spuriously fails.
> 
> Well, it sounds like Jens already has some handle on at least one
> io_uring shutdown case that didn't wait for completion.
> 
> At the same time, a random -EBUSY is kind of an expected failure in
> real life, since outside of strictly controlled environments you could
> easily have just some entirely unrelated thing that just happens to
> have looked at the filesystem when you tried to unmount it.
> 
> So any real-life use tends to use umount in a (limited) loop. It might
> just make sense for the fsstress test scripts to do the same
> regardless.
> 
> There's no actual good reason to think that -EBUSY is a hard error. It
> very much can be transient.

Indeed, any production kind of workload would have some kind of graceful
handling for that. That doesn't mean we should not fix the delayed fput
to avoid it if we can, just that it might make sense to have an xfstest
helper that at least tries X times with a sync in between or something
like that.

Darrick J. Wong Aug. 11, 2023, 10:56 p.m. UTC | #106

On Fri, Aug 11, 2023 at 09:21:41AM -0400, Kent Overstreet wrote:
> On Fri, Aug 11, 2023 at 12:54:42PM +0200, Christian Brauner wrote:
> > > I don't want to do that to Christian either, I think highly of the work
> > > he's been doing and I don't want to be adding to his frustration. So I
> > > apologize for loosing my cool earlier; a lot of that was frustration
> > > from other threads spilling over.
> > > 
> > > But: if he's going to be raising objections, I need to know what his
> > > concerns are if we're going to get anywhere. Raising objections without
> > > saying what the concerns are shuts down discussion; I don't think it's
> > > unreasonable to ask people not to do that, and to try and stay focused
> > > on the code.
> > 
> > The technical aspects were made clear off-list and I believe multiple
> > times on-list by now. Any VFS and block related patches are to be
> > reviewed and accepted before bcachefs gets merged.
> 
> Here's the one VFS patch in the series - could we at least get an ack
> for this? It's a new helper, just breaks the existing d_tmpfile() up
> into two functions - I hope we can at least agree that this patch
> shouldn't be controversial?
> 
> -->--
> Subject: [PATCH] fs: factor out d_mark_tmpfile()
> 
> New helper for bcachefs - bcachefs doesn't want the
> inode_dec_link_count() call that d_tmpfile does, it handles i_nlink on
> its own atomically with other btree updates
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: linux-fsdevel@vger.kernel.org

Yes, we can finally clean up this braindamage in xfs_generic_create:

	if (tmpfile) {
		/*
		 * The VFS requires that any inode fed to d_tmpfile must
		 * have nlink == 1 so that it can decrement the nlink in
		 * d_tmpfile.  However, we created the temp file with
		 * nlink == 0 because we're not allowed to put an inode
		 * with nlink > 0 on the unlinked list.  Therefore we
		 * have to set nlink to 1 so that d_tmpfile can
		 * immediately set it back to zero.
		 */
		set_nlink(inode, 1);
		d_tmpfile(tmpfile, inode);
	}

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 52e6d5fdab..dbdafa2617 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -3249,11 +3249,10 @@ void d_genocide(struct dentry *parent)
>  
>  EXPORT_SYMBOL(d_genocide);
>  
> -void d_tmpfile(struct file *file, struct inode *inode)
> +void d_mark_tmpfile(struct file *file, struct inode *inode)
>  {
>  	struct dentry *dentry = file->f_path.dentry;
>  
> -	inode_dec_link_count(inode);
>  	BUG_ON(dentry->d_name.name != dentry->d_iname ||
>  		!hlist_unhashed(&dentry->d_u.d_alias) ||
>  		!d_unlinked(dentry));
> @@ -3263,6 +3262,15 @@ void d_tmpfile(struct file *file, struct inode *inode)
>  				(unsigned long long)inode->i_ino);
>  	spin_unlock(&dentry->d_lock);
>  	spin_unlock(&dentry->d_parent->d_lock);
> +}
> +EXPORT_SYMBOL(d_mark_tmpfile);
> +
> +void d_tmpfile(struct file *file, struct inode *inode)
> +{
> +	struct dentry *dentry = file->f_path.dentry;
> +
> +	inode_dec_link_count(inode);
> +	d_mark_tmpfile(file, inode);
>  	d_instantiate(dentry, inode);
>  }
>  EXPORT_SYMBOL(d_tmpfile);
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 6b351e009f..3da2f0545d 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -251,6 +251,7 @@ extern struct dentry * d_make_root(struct inode *);
>  /* <clickety>-<click> the ramfs-type tree */
>  extern void d_genocide(struct dentry *);
>  
> +extern void d_mark_tmpfile(struct file *, struct inode *);
>  extern void d_tmpfile(struct file *, struct inode *);
>  
>  extern struct dentry *d_find_alias(struct inode *);

Christian Brauner Aug. 14, 2023, 7:21 a.m. UTC | #107

On Fri, Aug 11, 2023 at 09:21:41AM -0400, Kent Overstreet wrote:
> On Fri, Aug 11, 2023 at 12:54:42PM +0200, Christian Brauner wrote:
> > > I don't want to do that to Christian either, I think highly of the work
> > > he's been doing and I don't want to be adding to his frustration. So I
> > > apologize for loosing my cool earlier; a lot of that was frustration
> > > from other threads spilling over.
> > > 
> > > But: if he's going to be raising objections, I need to know what his
> > > concerns are if we're going to get anywhere. Raising objections without
> > > saying what the concerns are shuts down discussion; I don't think it's
> > > unreasonable to ask people not to do that, and to try and stay focused
> > > on the code.
> > 
> > The technical aspects were made clear off-list and I believe multiple
> > times on-list by now. Any VFS and block related patches are to be
> > reviewed and accepted before bcachefs gets merged.
> 
> Here's the one VFS patch in the series - could we at least get an ack
> for this? It's a new helper, just breaks the existing d_tmpfile() up
> into two functions - I hope we can at least agree that this patch
> shouldn't be controversial?
> 
> -->--
> Subject: [PATCH] fs: factor out d_mark_tmpfile()
> 
> New helper for bcachefs - bcachefs doesn't want the
> inode_dec_link_count() call that d_tmpfile does, it handles i_nlink on
> its own atomically with other btree updates
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: linux-fsdevel@vger.kernel.org

Yep, that looks good,
Reviewed-by: Christian Brauner <brauner@kernel.org>

Christian Brauner Aug. 14, 2023, 7:25 a.m. UTC | #108

On Fri, Aug 11, 2023 at 08:58:01AM -0400, Kent Overstreet wrote:
> I don't see the justification for the delay - every cycle there's some
> amount of vfs/block layer refactoring that affects filesystems, the
> super work is no different.

So, the reason is that we're very close to having the super code
massaged in a shape where bcachefs should be able to directly make use
of the helpers instead of having to pull in custom code at all. But not
all that work has made it.

Kent Overstreet Aug. 14, 2023, 3:23 p.m. UTC | #109

On Mon, Aug 14, 2023 at 09:25:54AM +0200, Christian Brauner wrote:
> On Fri, Aug 11, 2023 at 08:58:01AM -0400, Kent Overstreet wrote:
> > I don't see the justification for the delay - every cycle there's some
> > amount of vfs/block layer refactoring that affects filesystems, the
> > super work is no different.
> 
> So, the reason is that we're very close to having the super code
> massaged in a shape where bcachefs should be able to directly make use
> of the helpers instead of having to pull in custom code at all. But not
> all that work has made it.

Well, bcachefs really isn't doing anything terribly unusual here; we're
using sget() directly, same as btrfs, and we have to because we're both
multi device filesystems.

Jan's restructing of mount_bdev() got me thinking that it should be
possible to do a mount_bdevs() that both btrfs and bcachefs could use -
but we don't need to be blocked on that, sget()'s been a normal exported
interface since forever.

Somewhat related, I dropped this patch from my tree:
block: Don't block on s_umount from __invalidate_super()
https://evilpiepirate.org/git/bcachefs.git/commit/?h=bcachefs-v6.3&id=1dd488901bc025a61e1ce1a0f54999a2b221bd78

and instead, for now we're closing block devices later in the shutdown
path like other filesystems do (after calling generic_shutdown_super(),
not in put_super()).

But now I've got some test failures, e.g.
https://evilpiepirate.org/~testdashboard/c/040e910f7f316ea6273c895dcc026b9f1ad36a8e/xfstests.generic.604/log.br

and since you guys are switching block device opens to use a real
holder, I suspect you'll be seeing the same issue soon.

The bug is that the mount appears to be gone - generic_shutdown_super()
is finished - so as far as userspace can tell everything is shutdown and
we should be able to start using the block device again, but the unmount
path hasn't actually called blkdev_put() yet.

So that patch I posted is one way to solve the self-deadlock from
calling blkdev_put() where we really want to be calling it... not the
prettiest way, but I think this is something we do need to get fixed.

Kent Overstreet Aug. 14, 2023, 3:27 p.m. UTC | #110

On Mon, Aug 14, 2023 at 09:21:22AM +0200, Christian Brauner wrote:
> On Fri, Aug 11, 2023 at 09:21:41AM -0400, Kent Overstreet wrote:
> > On Fri, Aug 11, 2023 at 12:54:42PM +0200, Christian Brauner wrote:
> > > > I don't want to do that to Christian either, I think highly of the work
> > > > he's been doing and I don't want to be adding to his frustration. So I
> > > > apologize for loosing my cool earlier; a lot of that was frustration
> > > > from other threads spilling over.
> > > > 
> > > > But: if he's going to be raising objections, I need to know what his
> > > > concerns are if we're going to get anywhere. Raising objections without
> > > > saying what the concerns are shuts down discussion; I don't think it's
> > > > unreasonable to ask people not to do that, and to try and stay focused
> > > > on the code.
> > > 
> > > The technical aspects were made clear off-list and I believe multiple
> > > times on-list by now. Any VFS and block related patches are to be
> > > reviewed and accepted before bcachefs gets merged.
> > 
> > Here's the one VFS patch in the series - could we at least get an ack
> > for this? It's a new helper, just breaks the existing d_tmpfile() up
> > into two functions - I hope we can at least agree that this patch
> > shouldn't be controversial?
> > 
> > -->--
> > Subject: [PATCH] fs: factor out d_mark_tmpfile()
> > 
> > New helper for bcachefs - bcachefs doesn't want the
> > inode_dec_link_count() call that d_tmpfile does, it handles i_nlink on
> > its own atomically with other btree updates
> > 
> > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> > Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: linux-fsdevel@vger.kernel.org
> 
> Yep, that looks good,
> Reviewed-by: Christian Brauner <brauner@kernel.org>

Thanks, much appreciated

Dave Chinner Aug. 21, 2023, 12:09 a.m. UTC | #111

[Been on PTO this last week and a half]

On Thu, Aug 10, 2023 at 11:45:26PM -0400, Kent Overstreet wrote:
> On Thu, Aug 10, 2023 at 03:39:42PM -0700, Darrick J. Wong wrote:
> > Nowadays we have all these people running bots and AIs throwing a steady
> > stream of bug reports and CVE reports at Dave [Chinner] and I.  Most of
> > these people *do not* help fix the problems they report.  Once in a
> > while there's an actual *user* report about data loss, but those
> > (thankfully) aren't the majority of the reports.
> > 
> > However, every one of these reports has to be triaged, analyzed, and
> > dealt with.  As soon as we clear one, at least one more rolls in.  You
> > know what that means?  Dave and I are both in a permanent state of
> > heightened alert, fear, and stress.  We never get to settle back down to
> > calm.  Every time someone brings up syzbot, CVEs, or security?  I feel
> > my own stress response ramping up.  I can no longer have "rational"
> > conversations about syzbot because those discussions push my buttons.
> > 
> > This is not healthy!
> 
> Yeah, we really need to take a step back and ask ourselves what we're
> trying to do here.
> 
> At this point, I'm not so sure hardening xfs/ext4 in all the ways people
> are wanting them to be hardened is a realistic idea: these are huge, old
> C codebases that are tricky to work on, and they weren't designed from
> the start with these kinds of considerations. Yes, in a perfect world
> all code should be secure and all bugs should be fixed, but is this the
> way to do it?

Look at it this way: For XFS we've already done the hardening work -
we started that way back in 2008 when we started planning for the V5
filesystem format to avoid all the random bit failures that were
occuring out there in the real world and from academic fuzzer
research.

The problem with syzbot has been that has been testing the old V4
format and it keeps tripping over different symptoms of the same
problems the v5 format either isn't susceptible to or it detects
and fixes/shuts down the filesystem.

Since syzbot finally turned off v4 format testing on the 3rd July,
we haven't had a single new syzbot bug report on XFS. I don't expect
syzbot to find a significant amount of new issues on XFS from this
point onwards...

So, yeah, I think we did the bulk of the possible format
verification/hardening work in XFS a decade ago, and the stream of
bugs we've been seeing is due to intentionally ignoring the format
that actually provides some defences against random bit manipulation
based failures...

> Personally, I think we'd be better served by putting what manpower we
> can spare into starting on an incremental Rust rewrite; at least that's
> my plan for bcachefs, and something I've been studying for awhile (as
> soon as the gcc rust stuff lands I'll be adding Rust code to
> fs/bcachefs, some code already exists). For xfs/ext4, teasing things
> apart and figuring out how to restructure data structures in a way to
> pass the borrow checker may not be realistic, I don't know the codebases
> well enough to say - but clearly the current approach is not working,
> and these codebases are almost definitely still going to be in use 50
> years from now, we need to be coming up with _some_ sort of plan.

For XFS, my plan for the past couple of years has been to start with
rewriting chunks of the userspace code in rust. That shares the core
libxfs code with the kernel, so the idea is that we slowly
reimplement bits of libxfs in userspace in rust where we have the
freedom to just rip and tear the code apart. Then when we have
something that largely works we can pull that core libxfs rust code
back into the kernel as rust support improves.

Of course, that's been largely put on the back burner over the past
year or so because of all the other demands on my time stuff like
dealing with 1-2 syzbot bug reports a week have resulted in....

> And if we had a coherent long term plan, maybe that would help with the
> funding and headcount issues...

I don't think a lack of a plan is the problem with funding and
headcount. At it's core, the problem is inherent in the capitalism
model that is "funding" the "community" - squeeze the most you can
from as little as possible and externalise the costs as much as
possible. Burning people out is essentially externalising the human
cost of corporate bean counter optimisation of the bottom line...

If I had a dollar for every time I'd been told "we don't have the
money for more resources" whilst both company revenue and profits
are continually going up, we could pay for another engineer...

[....]

> > A broader dynamic here is that I ask people to review the code so that I
> > can merge it; they say they will do it; and then an entire cycle goes by
> > without any visible progress.
> > 
> > When I ask these people why they didn't follow through on their
> > commitments, the responses I hear are pretty uniform -- they got buried
> > in root cause analysis of a real bug report but lol there were no other
> > senior people available; their time ended up being spent on backports or
> > arguing about backports; or they got caught up in that whole freakout
> > thing I described above.
> 
> Yeah, that set of priorities makes sense when we're talking about
> patches that modify existing code; if you can't keep up with bug reports
> then you have to slow down on changes, and changes to existing code
> often do need the meticulous review - and hopefully while people are
> waiting on code review they'll be helping out with bug reports.
> 
> But for new code that isn't going to upset existing users, if we trust
> the author to not do crazy things then code review is really more about
> making sure someone else understands the code. But if they're putting in
> all the proper effort to document, to organize things well, to do things
> responsibly, does it make sense for that level of code review to be an
> up front requirement? Perhaps we could think a _bit_ more about how we
> enable people to do good work.

That's pretty much the review rationale I'm using for the online
fsck code. I really only care in detail about how it interfaces with
the core XFS infrastructure, and as long as the rest of it makes
sense and doesn't make my eyes bleed then it's good enough.

That doesn't change the fact that it takes me at least a week to
read through 10,000 lines of code in sufficient rigour to form an
opinion on it, and that's before I know what I need to look at in
more detail.

So however you look at it, even a "good enough" review of 50,000
lines of new code (the size of online fsck) still requires a couple
of months of review time for someone who knows the subsystem
intimately...

> I'm sure the XFS people have thought about this more than I have, but
> given how long this has been taking you and the amount of pushback I
> feel it ought to be asked.

Certainly we have, and for a long time the merge criteria for code
that is tagged as EXPERIMENTAL has been lower (i.e. good enough) and
that's how we merged things like the v5 format, reflink support,
rmap support, etc without huge amounts of review friction. The
problem is that drive over the past few years for more intense
review because CI and bot driven testing with no regression policies
has really pushed common sense out the window.

These days it feels like we're only allowed to merge "perfect" code
otherwise the code is "insecure" (e.g. the policies being advocated
by syzbot developers). Hence review over the past few years got more
finicky and picky because of the fear that regressions will be
introduced with new code. This is a direct result of it being
drummed into developers that regressions and CI failures must be
avoided at -all costs-.

i.e. the policies and testing infrastructure that is being used to
"validate" these large software projects is pushing us hard towards
the "code must be perfect at first attempt" side of the coin rather
than more the more practical (and achievable) "good enough" bar.
CI is useful and good for code quality, but common sense has to
prevail at some point....

-Dave.

[GIT,PULL] bcachefs

Pull-request

Message

Comments