Message ID | 20230626214656.hcp4puionmtoloat@moria.home.lan (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [GIT,PULL] bcachefs | expand |
> (Worth noting the bug causing the most test failures by a wide margin is > actually an io_uring bug that causes random umount failures in shutdown > tests. Would be great to get that looked at, it doesn't just affect > bcachefs). Maybe if you had told someone about that it could get looked at? What is the test case and what is going wrong? > block: Add some exports for bcachefs > block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset > block: Bring back zero_fill_bio_iter > block: Don't block on s_umount from __invalidate_super() OK...
On Mon, Jun 26, 2023 at 05:11:29PM -0600, Jens Axboe wrote: > > (Worth noting the bug causing the most test failures by a wide margin is > > actually an io_uring bug that causes random umount failures in shutdown > > tests. Would be great to get that looked at, it doesn't just affect > > bcachefs). > > Maybe if you had told someone about that it could get looked at? I'm more likely to report bugs to people who have a history of being responsive... > What is the test case and what is going wrong? Example: https://evilpiepirate.org/~testdashboard/c/82973f03c0683f7ecebe14dfaa2c3c9989dd29fc/xfstests.generic.388/log.br I haven't personally seen it on xfs - Darrick knew something about it but he's on vacation. If I track down a reproducer on xfs I'll let you know. If you're wanting to dig into it on bcachefs, ktest is pretty easy to get going: https://evilpiepirate.org/git/ktest.git $ ~/ktest/root_image create # from your kernel tree: $ ~/ktest/build-test-kernel run -ILP ~/ktest/tests/bcachefs/xfstests.ktest/generic/388 I have some debug code I can give you from when I was tracing it through the mount path, I still have to find or recreate the part that tracked it down to io_uring...
On 6/26/23 6:06?PM, Kent Overstreet wrote: > On Mon, Jun 26, 2023 at 05:11:29PM -0600, Jens Axboe wrote: >>> (Worth noting the bug causing the most test failures by a wide margin is >>> actually an io_uring bug that causes random umount failures in shutdown >>> tests. Would be great to get that looked at, it doesn't just affect >>> bcachefs). >> >> Maybe if you had told someone about that it could get looked at? > > I'm more likely to report bugs to people who have a history of being > responsive... I maintain the code I put in the kernel, and generally respond to everything, and most certainly bug reports. >> What is the test case and what is going wrong? > > Example: https://evilpiepirate.org/~testdashboard/c/82973f03c0683f7ecebe14dfaa2c3c9989dd29fc/xfstests.generic.388/log.br > > I haven't personally seen it on xfs - Darrick knew something about it > but he's on vacation. If I track down a reproducer on xfs I'll let you > know. > > If you're wanting to dig into it on bcachefs, ktest is pretty easy to > get going: https://evilpiepirate.org/git/ktest.git > > $ ~/ktest/root_image create > # from your kernel tree: > $ ~/ktest/build-test-kernel run -ILP ~/ktest/tests/bcachefs/xfstests.ktest/generic/388 > > I have some debug code I can give you from when I was tracing it through > the mount path, I still have to find or recreate the part that tracked > it down to io_uring... Doesn't reproduce for me with XFS. The above ktest doesn't work for me either: ~/git/ktest/build-test-kernel run -ILP ~/git/ktest/tests/bcachefs/xfstests.ktest/generic/388 realpath: /home/axboe/git/ktest/tests/bcachefs/xfstests.ktest/generic/388: Not a directory Error 1 at /home/axboe/git/ktest/build-test-kernel 262 from: ktest_test=$(realpath "$1"), exiting and I suspect that should've been a space, but: ~/git/ktest/build-test-kernel run -ILP ~/git/ktest/tests/bcachefs/xfstests.ktest generic/388 Running test xfstests.ktest on m1max at /home/axboe/git/linux-block No tests found TEST FAILED If I just run generic/388 with bcachefs formatted drives, I get xfstests complaining as it tries to mount an XFS file system... As a side note, I do get these when compiling: fs/bcachefs/alloc_background.c: In function ‘bch2_check_alloc_info’: fs/bcachefs/alloc_background.c:1526:1: warning: the frame size of 2640 bytes is larger than 2048 bytes [-Wframe-larger-than=] 1526 | } | ^ fs/bcachefs/reflink.c: In function ‘bch2_remap_range’: fs/bcachefs/reflink.c:388:1: warning: the frame size of 2352 bytes is larger than 2048 bytes [-Wframe-larger-than=] 388 | } | ^
On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote: > Doesn't reproduce for me with XFS. The above ktest doesn't work for me > either: It just popped for me on xfs, but it took half an hour or so of looping vs. 30 seconds on bcachefs. > ~/git/ktest/build-test-kernel run -ILP ~/git/ktest/tests/bcachefs/xfstests.ktest/generic/388 > realpath: /home/axboe/git/ktest/tests/bcachefs/xfstests.ktest/generic/388: Not a directory > Error 1 at /home/axboe/git/ktest/build-test-kernel 262 from: ktest_test=$(realpath "$1"), exiting > > and I suspect that should've been a space, but: > > ~/git/ktest/build-test-kernel run -ILP ~/git/ktest/tests/bcachefs/xfstests.ktest generic/388 > Running test xfstests.ktest on m1max at /home/axboe/git/linux-block > No tests found > TEST FAILED doh, this is because we just changed it to pick up the list of tests from the test lists that fstests generated. Go into ktest/tests/xfstests and run make and it'll work. (Doesn't matter if make fails due to missing libraries, it'll re-run make inside the VM where the dependencies will all be available). > As a side note, I do get these when compiling: > > fs/bcachefs/alloc_background.c: In function ‘bch2_check_alloc_info’: > fs/bcachefs/alloc_background.c:1526:1: warning: the frame size of 2640 bytes is larger than 2048 bytes [-Wframe-larger-than=] > 1526 | } > | ^ > fs/bcachefs/reflink.c: In function ‘bch2_remap_range’: > fs/bcachefs/reflink.c:388:1: warning: the frame size of 2352 bytes is larger than 2048 bytes [-Wframe-larger-than=] > 388 | } yeah neither of those are super critical because they run top of the stack, but they do need to be addressed. might be time to start heap allocating btree_trans.
On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote: > fs/bcachefs/alloc_background.c: In function ‘bch2_check_alloc_info’: > fs/bcachefs/alloc_background.c:1526:1: warning: the frame size of 2640 bytes is larger than 2048 bytes [-Wframe-larger-than=] > 1526 | } > | ^ > fs/bcachefs/reflink.c: In function ‘bch2_remap_range’: > fs/bcachefs/reflink.c:388:1: warning: the frame size of 2352 bytes is larger than 2048 bytes [-Wframe-larger-than=] > 388 | } > | ^ What version of gcc are you using? I'm not seeing either of those warnings - I'm wondering if gcc recently got better about stack usage when inlining. also not seeing any reason why bch2_remap_range's stack frame should be that big, to my eye it looks like it should be more like 1k, so if anyone knows some magic for seeing stack frame layout that would be handy... anyways, there's a patch in my testing branch that should fix bch2_check_alloc_info
On 6/26/23 8:05?PM, Kent Overstreet wrote: > On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote: >> Doesn't reproduce for me with XFS. The above ktest doesn't work for me >> either: > > It just popped for me on xfs, but it took half an hour or so of looping > vs. 30 seconds on bcachefs. OK, I'll try and leave it running overnight and see if I can get it to trigger. >> ~/git/ktest/build-test-kernel run -ILP ~/git/ktest/tests/bcachefs/xfstests.ktest/generic/388 >> realpath: /home/axboe/git/ktest/tests/bcachefs/xfstests.ktest/generic/388: Not a directory >> Error 1 at /home/axboe/git/ktest/build-test-kernel 262 from: ktest_test=$(realpath "$1"), exiting >> >> and I suspect that should've been a space, but: >> >> ~/git/ktest/build-test-kernel run -ILP ~/git/ktest/tests/bcachefs/xfstests.ktest generic/388 >> Running test xfstests.ktest on m1max at /home/axboe/git/linux-block >> No tests found >> TEST FAILED > > doh, this is because we just changed it to pick up the list of tests > from the test lists that fstests generated. > > Go into ktest/tests/xfstests and run make and it'll work. (Doesn't > matter if make fails due to missing libraries, it'll re-run make inside > the VM where the dependencies will all be available). OK, I'll try that as well. BTW, ran into these too. Didn't do anything, it was just a mount and umount trying to get the test going: axboe@m1max-kvm ~/g/k/t/xfstests> sudo cat /sys/kernel/debug/kmemleak unreferenced object 0xffff000201a5e000 (size 1024): comm "bch-copygc/nvme", pid 11362, jiffies 4295015821 (age 6863.776s) hex dump (first 32 bytes): 40 00 00 00 00 00 00 00 62 aa e8 ee 00 00 00 00 @.......b....... 10 e0 a5 01 02 00 ff ff 10 e0 a5 01 02 00 ff ff ................ backtrace: [<000000002668da56>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<000000006b0b510c>] __kmem_cache_alloc_node+0xd0/0x178 [<00000000041cfdde>] __kmalloc_node+0xac/0xd4 [<00000000e1556d66>] kvmalloc_node+0x54/0xe4 [<00000000df620afb>] bucket_table_alloc.isra.0+0x44/0x120 [<000000005d44ce16>] rhashtable_init+0x148/0x1ac [<00000000fdca7475>] bch2_copygc_thread+0x50/0x2e4 [<00000000ea76e08f>] kthread+0xc4/0xd4 [<0000000068107ad6>] ret_from_fork+0x10/0x20 unreferenced object 0xffff000200eed800 (size 1024): comm "bch-copygc/nvme", pid 13934, jiffies 4295086192 (age 6582.296s) hex dump (first 32 bytes): 40 00 00 00 00 00 00 00 e8 a5 2a bb 00 00 00 00 @.........*..... 10 d8 ee 00 02 00 ff ff 10 d8 ee 00 02 00 ff ff ................ backtrace: [<000000002668da56>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<000000006b0b510c>] __kmem_cache_alloc_node+0xd0/0x178 [<00000000041cfdde>] __kmalloc_node+0xac/0xd4 [<00000000e1556d66>] kvmalloc_node+0x54/0xe4 [<00000000df620afb>] bucket_table_alloc.isra.0+0x44/0x120 [<000000005d44ce16>] rhashtable_init+0x148/0x1ac [<00000000fdca7475>] bch2_copygc_thread+0x50/0x2e4 [<00000000ea76e08f>] kthread+0xc4/0xd4 [<0000000068107ad6>] ret_from_fork+0x10/0x20
On 6/26/23 8:33?PM, Kent Overstreet wrote: > On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote: >> fs/bcachefs/alloc_background.c: In function ?bch2_check_alloc_info?: >> fs/bcachefs/alloc_background.c:1526:1: warning: the frame size of 2640 bytes is larger than 2048 bytes [-Wframe-larger-than=] >> 1526 | } >> | ^ >> fs/bcachefs/reflink.c: In function ?bch2_remap_range?: >> fs/bcachefs/reflink.c:388:1: warning: the frame size of 2352 bytes is larger than 2048 bytes [-Wframe-larger-than=] >> 388 | } >> | ^ > > What version of gcc are you using? I'm not seeing either of those > warnings - I'm wondering if gcc recently got better about stack usage > when inlining. Using: gcc (Debian 13.1.0-6) 13.1.0 and it's on arm64, fwiw.
On Mon, Jun 26, 2023 at 08:59:13PM -0600, Jens Axboe wrote: > On 6/26/23 8:05?PM, Kent Overstreet wrote: > > On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote: > >> Doesn't reproduce for me with XFS. The above ktest doesn't work for me > >> either: > > > > It just popped for me on xfs, but it took half an hour or so of looping > > vs. 30 seconds on bcachefs. > > OK, I'll try and leave it running overnight and see if I can get it to > trigger. > > >> ~/git/ktest/build-test-kernel run -ILP ~/git/ktest/tests/bcachefs/xfstests.ktest/generic/388 > >> realpath: /home/axboe/git/ktest/tests/bcachefs/xfstests.ktest/generic/388: Not a directory > >> Error 1 at /home/axboe/git/ktest/build-test-kernel 262 from: ktest_test=$(realpath "$1"), exiting > >> > >> and I suspect that should've been a space, but: > >> > >> ~/git/ktest/build-test-kernel run -ILP ~/git/ktest/tests/bcachefs/xfstests.ktest generic/388 > >> Running test xfstests.ktest on m1max at /home/axboe/git/linux-block > >> No tests found > >> TEST FAILED > > > > doh, this is because we just changed it to pick up the list of tests > > from the test lists that fstests generated. > > > > Go into ktest/tests/xfstests and run make and it'll work. (Doesn't > > matter if make fails due to missing libraries, it'll re-run make inside > > the VM where the dependencies will all be available). > > OK, I'll try that as well. > > BTW, ran into these too. Didn't do anything, it was just a mount and > umount trying to get the test going: > > axboe@m1max-kvm ~/g/k/t/xfstests> sudo cat /sys/kernel/debug/kmemleak > unreferenced object 0xffff000201a5e000 (size 1024): > comm "bch-copygc/nvme", pid 11362, jiffies 4295015821 (age 6863.776s) > hex dump (first 32 bytes): > 40 00 00 00 00 00 00 00 62 aa e8 ee 00 00 00 00 @.......b....... > 10 e0 a5 01 02 00 ff ff 10 e0 a5 01 02 00 ff ff ................ > backtrace: > [<000000002668da56>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<000000006b0b510c>] __kmem_cache_alloc_node+0xd0/0x178 > [<00000000041cfdde>] __kmalloc_node+0xac/0xd4 > [<00000000e1556d66>] kvmalloc_node+0x54/0xe4 > [<00000000df620afb>] bucket_table_alloc.isra.0+0x44/0x120 > [<000000005d44ce16>] rhashtable_init+0x148/0x1ac > [<00000000fdca7475>] bch2_copygc_thread+0x50/0x2e4 > [<00000000ea76e08f>] kthread+0xc4/0xd4 > [<0000000068107ad6>] ret_from_fork+0x10/0x20 > unreferenced object 0xffff000200eed800 (size 1024): > comm "bch-copygc/nvme", pid 13934, jiffies 4295086192 (age 6582.296s) > hex dump (first 32 bytes): > 40 00 00 00 00 00 00 00 e8 a5 2a bb 00 00 00 00 @.........*..... > 10 d8 ee 00 02 00 ff ff 10 d8 ee 00 02 00 ff ff ................ > backtrace: > [<000000002668da56>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<000000006b0b510c>] __kmem_cache_alloc_node+0xd0/0x178 > [<00000000041cfdde>] __kmalloc_node+0xac/0xd4 > [<00000000e1556d66>] kvmalloc_node+0x54/0xe4 > [<00000000df620afb>] bucket_table_alloc.isra.0+0x44/0x120 > [<000000005d44ce16>] rhashtable_init+0x148/0x1ac > [<00000000fdca7475>] bch2_copygc_thread+0x50/0x2e4 > [<00000000ea76e08f>] kthread+0xc4/0xd4 > [<0000000068107ad6>] ret_from_fork+0x10/0x20 yup, missing a rhashtable_destroy() call, I'll do some kmemleak testing
On Mon, Jun 26, 2023 at 08:59:44PM -0600, Jens Axboe wrote: > On 6/26/23 8:33?PM, Kent Overstreet wrote: > > On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote: > >> fs/bcachefs/alloc_background.c: In function ?bch2_check_alloc_info?: > >> fs/bcachefs/alloc_background.c:1526:1: warning: the frame size of 2640 bytes is larger than 2048 bytes [-Wframe-larger-than=] > >> 1526 | } > >> | ^ > >> fs/bcachefs/reflink.c: In function ?bch2_remap_range?: > >> fs/bcachefs/reflink.c:388:1: warning: the frame size of 2352 bytes is larger than 2048 bytes [-Wframe-larger-than=] > >> 388 | } > >> | ^ > > > > What version of gcc are you using? I'm not seeing either of those > > warnings - I'm wondering if gcc recently got better about stack usage > > when inlining. > > Using: > > gcc (Debian 13.1.0-6) 13.1.0 > > and it's on arm64, fwiw. OOI what PAGE_SIZE do you have configured? Sometimes fs data structures are PAGE_SIZE dependent (haven't looked at this particular bcachefs data structure). We've also had weirdness with various gcc versions on some architectures making different inlining decisions from x86.
On Tue, Jun 27, 2023 at 04:19:33AM +0100, Matthew Wilcox wrote: > On Mon, Jun 26, 2023 at 08:59:44PM -0600, Jens Axboe wrote: > > On 6/26/23 8:33?PM, Kent Overstreet wrote: > > > On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote: > > >> fs/bcachefs/alloc_background.c: In function ?bch2_check_alloc_info?: > > >> fs/bcachefs/alloc_background.c:1526:1: warning: the frame size of 2640 bytes is larger than 2048 bytes [-Wframe-larger-than=] > > >> 1526 | } > > >> | ^ > > >> fs/bcachefs/reflink.c: In function ?bch2_remap_range?: > > >> fs/bcachefs/reflink.c:388:1: warning: the frame size of 2352 bytes is larger than 2048 bytes [-Wframe-larger-than=] > > >> 388 | } > > >> | ^ > > > > > > What version of gcc are you using? I'm not seeing either of those > > > warnings - I'm wondering if gcc recently got better about stack usage > > > when inlining. > > > > Using: > > > > gcc (Debian 13.1.0-6) 13.1.0 > > > > and it's on arm64, fwiw. > > OOI what PAGE_SIZE do you have configured? Sometimes fs data structures > are PAGE_SIZE dependent (haven't looked at this particular bcachefs data > structure). We've also had weirdness with various gcc versions on some > architectures making different inlining decisions from x86. There are very few references to PAGE_SIZE in bcachefs, I've killed off as much of that as I can because all this code has to work in userspace too and depending on PAGE_SIZE is sketchy there
Nacked-by: Christoph Hellwig <hch@lst.de>
Kent,
you really need to feed your prep patches through the maintainers
instead of just starting fights everywhere. And you really need
someone else than your vouch for the code in the form of a co-maintainer
or reviewer.
On Mon, Jun 26, 2023 at 08:52:39PM -0700, Christoph Hellwig wrote: > > Nacked-by: Christoph Hellwig <hch@lst.de> > > Kent, > > you really need to feed your prep patches through the maintainers > instead of just starting fights everywhere. And you really need > someone else than your vouch for the code in the form of a co-maintainer > or reviewer. If there's a patch that you think is at issue, then I invite you to point it out. Just please deliver your explanation with more technical precision and less vitriol - thanks.
On 6/26/23 8:59?PM, Jens Axboe wrote: > On 6/26/23 8:05?PM, Kent Overstreet wrote: >> On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote: >>> Doesn't reproduce for me with XFS. The above ktest doesn't work for me >>> either: >> >> It just popped for me on xfs, but it took half an hour or so of looping >> vs. 30 seconds on bcachefs. > > OK, I'll try and leave it running overnight and see if I can get it to > trigger. I did manage to reproduce it, and also managed to get bcachefs to run the test. But I had to add: diff --git a/check b/check index 5f9f1a6bec88..6d74bd4933bd 100755 --- a/check +++ b/check @@ -283,7 +283,7 @@ while [ $# -gt 0 ]; do case "$1" in -\? | -h | --help) usage ;; - -nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs) + -nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs|-bcachefs) FSTYP="${1:1}" ;; -overlay) to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept failing because it assumed it was XFS. I suspected this was just a timing issue, and it looks like that's exactly what it is. Looking at the test case, it'll randomly kill -9 fsstress, and if that happens while we have io_uring IO pending, then we process completions inline (for a PF_EXITING current). This means they get pushed to fallback work, which runs out of line. If we hit that case AND the timing is such that it hasn't been processed yet, we'll still be holding a file reference under the mount point and umount will -EBUSY fail. As far as I can tell, this can happen with aio as well, it's just harder to hit. If the fput happens while the task is exiting, then fput will end up being delayed through a workqueue as well. The test case assumes that once it's reaped the exit of the killed task that all files are released, which isn't necessarily true if they are done out-of-line. For io_uring specifically, it may make sense to wait on the fallback work. The below patch does this, and should fix the issue. But I'm not fully convinced that this is really needed, as I do think this can happen without io_uring as well. It just doesn't right now as the test does buffered IO, and aio will be fully sync with buffered IO. That means there's either no gap where aio will hit it without O_DIRECT, or it's just small enough that it hasn't been hit. diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 3bca7a79efda..7abad5cb2131 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -150,7 +150,6 @@ static void io_clean_op(struct io_kiocb *req); static void io_queue_sqe(struct io_kiocb *req); static void io_move_task_work_from_local(struct io_ring_ctx *ctx); static void __io_submit_flush_completions(struct io_ring_ctx *ctx); -static __cold void io_fallback_tw(struct io_uring_task *tctx); struct kmem_cache *req_cachep; @@ -1248,6 +1247,49 @@ static inline struct llist_node *io_llist_cmpxchg(struct llist_head *head, return cmpxchg(&head->first, old, new); } +#define NR_FALLBACK_CTX 8 + +static __cold void io_flush_fallback(struct io_ring_ctx **ctxs, int *nr_ctx) +{ + int i; + + for (i = 0; i < *nr_ctx; i++) { + struct io_ring_ctx *ctx = ctxs[i]; + + flush_delayed_work(&ctx->fallback_work); + percpu_ref_put(&ctx->refs); + } + *nr_ctx = 0; +} + +static __cold void io_flush_fallback_add(struct io_ring_ctx *ctx, + struct io_ring_ctx **ctxs, int *nr_ctx) +{ + percpu_ref_get(&ctx->refs); + ctxs[(*nr_ctx)++] = ctx; + if (*nr_ctx == NR_FALLBACK_CTX) + io_flush_fallback(ctxs, nr_ctx); +} + +static __cold void io_fallback_tw(struct io_uring_task *tctx, bool sync) +{ + struct llist_node *node = llist_del_all(&tctx->task_list); + struct io_ring_ctx *ctxs[NR_FALLBACK_CTX]; + struct io_kiocb *req; + int nr_ctx = 0; + + while (node) { + req = container_of(node, struct io_kiocb, io_task_work.node); + node = node->next; + if (sync) + io_flush_fallback_add(req->ctx, ctxs, &nr_ctx); + if (llist_add(&req->io_task_work.node, + &req->ctx->fallback_llist)) + schedule_delayed_work(&req->ctx->fallback_work, 1); + } + io_flush_fallback(ctxs, &nr_ctx); +} + void tctx_task_work(struct callback_head *cb) { struct io_tw_state ts = {}; @@ -1260,7 +1302,7 @@ void tctx_task_work(struct callback_head *cb) unsigned int count = 0; if (unlikely(current->flags & PF_EXITING)) { - io_fallback_tw(tctx); + io_fallback_tw(tctx, true); return; } @@ -1289,20 +1331,6 @@ void tctx_task_work(struct callback_head *cb) trace_io_uring_task_work_run(tctx, count, loops); } -static __cold void io_fallback_tw(struct io_uring_task *tctx) -{ - struct llist_node *node = llist_del_all(&tctx->task_list); - struct io_kiocb *req; - - while (node) { - req = container_of(node, struct io_kiocb, io_task_work.node); - node = node->next; - if (llist_add(&req->io_task_work.node, - &req->ctx->fallback_llist)) - schedule_delayed_work(&req->ctx->fallback_work, 1); - } -} - static void io_req_local_work_add(struct io_kiocb *req, unsigned flags) { struct io_ring_ctx *ctx = req->ctx; @@ -1377,7 +1405,7 @@ void __io_req_task_work_add(struct io_kiocb *req, unsigned flags) if (likely(!task_work_add(req->task, &tctx->task_work, ctx->notify_method))) return; - io_fallback_tw(tctx); + io_fallback_tw(tctx, false); } static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
On Tue, Jun 27, 2023 at 11:16:01AM -0600, Jens Axboe wrote: > On 6/26/23 8:59?PM, Jens Axboe wrote: > > On 6/26/23 8:05?PM, Kent Overstreet wrote: > >> On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote: > >>> Doesn't reproduce for me with XFS. The above ktest doesn't work for me > >>> either: > >> > >> It just popped for me on xfs, but it took half an hour or so of looping > >> vs. 30 seconds on bcachefs. > > > > OK, I'll try and leave it running overnight and see if I can get it to > > trigger. > > I did manage to reproduce it, and also managed to get bcachefs to run > the test. But I had to add: > > diff --git a/check b/check > index 5f9f1a6bec88..6d74bd4933bd 100755 > --- a/check > +++ b/check > @@ -283,7 +283,7 @@ while [ $# -gt 0 ]; do > case "$1" in > -\? | -h | --help) usage ;; > > - -nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs) > + -nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs|-bcachefs) > FSTYP="${1:1}" > ;; > -overlay) I wonder if this is due to an upstream fstests change I haven't seen yet, I'll have a look. > to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept > failing because it assumed it was XFS. > > I suspected this was just a timing issue, and it looks like that's > exactly what it is. Looking at the test case, it'll randomly kill -9 > fsstress, and if that happens while we have io_uring IO pending, then we > process completions inline (for a PF_EXITING current). This means they > get pushed to fallback work, which runs out of line. If we hit that case > AND the timing is such that it hasn't been processed yet, we'll still be > holding a file reference under the mount point and umount will -EBUSY > fail. > > As far as I can tell, this can happen with aio as well, it's just harder > to hit. If the fput happens while the task is exiting, then fput will > end up being delayed through a workqueue as well. The test case assumes > that once it's reaped the exit of the killed task that all files are > released, which isn't necessarily true if they are done out-of-line. Yeah, I traced it through to the delayed fput code as well. I'm not sure delayed fput is responsible here; what I learned when I was tracking this down has mostly fell out of my brain, so take anything I say with a large grain of salt. But I believe I tested with delayed_fput completely disabled, and found another thing in io_uring with the same effect as delayed_fput that wasn't being flushed. > For io_uring specifically, it may make sense to wait on the fallback > work. The below patch does this, and should fix the issue. But I'm not > fully convinced that this is really needed, as I do think this can > happen without io_uring as well. It just doesn't right now as the test > does buffered IO, and aio will be fully sync with buffered IO. That > means there's either no gap where aio will hit it without O_DIRECT, or > it's just small enough that it hasn't been hit. I just tried your patch and I still have generic/388 failing - it might've taken a bit longer to pop this time. I wonder if there might be a better way of solving this though? For aio, when a process is exiting we just synchronously tear down the ioctx, including waiting for outstanding iocbs. delayed_fput, even though I believe not responsible here, seems sketchy to me because there doesn't seem to be a straightforward way to flush delayed fputs for a given _process_ - there's a single global work item, and we can only flush globally. Would what aio does work here? (disclaimer: I haven't studied the io_uring code so I haven't figured out the approach your patch is taking yet)
On Tue, Jun 27, 2023 at 04:15:24PM -0400, Kent Overstreet wrote: > On Tue, Jun 27, 2023 at 11:16:01AM -0600, Jens Axboe wrote: > > On 6/26/23 8:59?PM, Jens Axboe wrote: > > > On 6/26/23 8:05?PM, Kent Overstreet wrote: > > >> On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote: > > >>> Doesn't reproduce for me with XFS. The above ktest doesn't work for me > > >>> either: > > >> > > >> It just popped for me on xfs, but it took half an hour or so of looping > > >> vs. 30 seconds on bcachefs. > > > > > > OK, I'll try and leave it running overnight and see if I can get it to > > > trigger. > > > > I did manage to reproduce it, and also managed to get bcachefs to run > > the test. But I had to add: > > > > diff --git a/check b/check > > index 5f9f1a6bec88..6d74bd4933bd 100755 > > --- a/check > > +++ b/check > > @@ -283,7 +283,7 @@ while [ $# -gt 0 ]; do > > case "$1" in > > -\? | -h | --help) usage ;; > > > > - -nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs) > > + -nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs|-bcachefs) > > FSTYP="${1:1}" > > ;; > > -overlay) > > I wonder if this is due to an upstream fstests change I haven't seen > yet, I'll have a look. Run mkfs.bcachefs on the testdir first. fstests tries to probe the filesystem type to test if $FSTYP is not set. If it doesn't find a filesystem or it is unsupported, it will use the default (i.e. XFS). There should be no reason to need to specify the filesystem type for filesystems that blkid recognises. from common/config: # Autodetect fs type based on what's on $TEST_DEV unless it's been set # externally if [ -z "$FSTYP" ] && [ ! -z "$TEST_DEV" ]; then FSTYP=`blkid -c /dev/null -s TYPE -o value $TEST_DEV` fi FSTYP=${FSTYP:=xfs} export FSTYP Cheers, Dave.
On Wed, Jun 28, 2023 at 08:05:06AM +1000, Dave Chinner wrote: > On Tue, Jun 27, 2023 at 04:15:24PM -0400, Kent Overstreet wrote: > > On Tue, Jun 27, 2023 at 11:16:01AM -0600, Jens Axboe wrote: > > > On 6/26/23 8:59?PM, Jens Axboe wrote: > > > > On 6/26/23 8:05?PM, Kent Overstreet wrote: > > > >> On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote: > > > >>> Doesn't reproduce for me with XFS. The above ktest doesn't work for me > > > >>> either: > > > >> > > > >> It just popped for me on xfs, but it took half an hour or so of looping > > > >> vs. 30 seconds on bcachefs. > > > > > > > > OK, I'll try and leave it running overnight and see if I can get it to > > > > trigger. > > > > > > I did manage to reproduce it, and also managed to get bcachefs to run > > > the test. But I had to add: > > > > > > diff --git a/check b/check > > > index 5f9f1a6bec88..6d74bd4933bd 100755 > > > --- a/check > > > +++ b/check > > > @@ -283,7 +283,7 @@ while [ $# -gt 0 ]; do > > > case "$1" in > > > -\? | -h | --help) usage ;; > > > > > > - -nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs) > > > + -nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs|-bcachefs) > > > FSTYP="${1:1}" > > > ;; > > > -overlay) > > > > I wonder if this is due to an upstream fstests change I haven't seen > > yet, I'll have a look. > > Run mkfs.bcachefs on the testdir first. fstests tries to probe the > filesystem type to test if $FSTYP is not set. If it doesn't find a > filesystem or it is unsupported, it will use the default (i.e. XFS). > There should be no reason to need to specify the filesystem type for > filesystems that blkid recognises. from common/config: Actually ktest already does that, and it sets $FSTYP as well. Jens, are you sure you weren't doing something funny?
On 6/27/23 2:15?PM, Kent Overstreet wrote: >> to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept >> failing because it assumed it was XFS. >> >> I suspected this was just a timing issue, and it looks like that's >> exactly what it is. Looking at the test case, it'll randomly kill -9 >> fsstress, and if that happens while we have io_uring IO pending, then we >> process completions inline (for a PF_EXITING current). This means they >> get pushed to fallback work, which runs out of line. If we hit that case >> AND the timing is such that it hasn't been processed yet, we'll still be >> holding a file reference under the mount point and umount will -EBUSY >> fail. >> >> As far as I can tell, this can happen with aio as well, it's just harder >> to hit. If the fput happens while the task is exiting, then fput will >> end up being delayed through a workqueue as well. The test case assumes >> that once it's reaped the exit of the killed task that all files are >> released, which isn't necessarily true if they are done out-of-line. > > Yeah, I traced it through to the delayed fput code as well. > > I'm not sure delayed fput is responsible here; what I learned when I was > tracking this down has mostly fell out of my brain, so take anything I > say with a large grain of salt. But I believe I tested with delayed_fput > completely disabled, and found another thing in io_uring with the same > effect as delayed_fput that wasn't being flushed. I'm not saying it's delayed_fput(), I'm saying it's the delayed putting io_uring can end up doing. But yes, delayed_fput() is another candidate. >> For io_uring specifically, it may make sense to wait on the fallback >> work. The below patch does this, and should fix the issue. But I'm not >> fully convinced that this is really needed, as I do think this can >> happen without io_uring as well. It just doesn't right now as the test >> does buffered IO, and aio will be fully sync with buffered IO. That >> means there's either no gap where aio will hit it without O_DIRECT, or >> it's just small enough that it hasn't been hit. > > I just tried your patch and I still have generic/388 failing - it > might've taken a bit longer to pop this time. Yep see the same here. Didn't have time to look into it after sending that email today, just took a quick stab at writing a reproducer and ended up crashing bcachefs: [ 1122.384909] workqueue: Failed to create a rescuer kthread for wq "bcachefs": -EINTR [ 1122.384915] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 1122.385814] Mem abort info: [ 1122.385962] ESR = 0x0000000096000004 [ 1122.386161] EC = 0x25: DABT (current EL), IL = 32 bits [ 1122.386444] SET = 0, FnV = 0 [ 1122.386612] EA = 0, S1PTW = 0 [ 1122.386842] FSC = 0x04: level 0 translation fault [ 1122.387168] Data abort info: [ 1122.387321] ISV = 0, ISS = 0x00000004 [ 1122.387518] CM = 0, WnR = 0 [ 1122.387676] user pgtable: 4k pages, 48-bit VAs, pgdp=00000001133da000 [ 1122.388014] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000 [ 1122.388363] Internal error: Oops: 0000000096000004 [#1] SMP [ 1122.388659] Modules linked in: [ 1122.388866] CPU: 4 PID: 23129 Comm: mount Not tainted 6.4.0-02556-ge61c7fc22b68-dirty #3647 [ 1122.389389] Hardware name: linux,dummy-virt (DT) [ 1122.389682] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 1122.390118] pc : bch2_free_pending_node_rewrites+0x40/0x90 [ 1122.390466] lr : bch2_free_pending_node_rewrites+0x28/0x90 [ 1122.390815] sp : ffff80002481b770 [ 1122.391030] x29: ffff80002481b770 x28: ffff0000e1d24000 x27: 00000000fffff7b7 [ 1122.391475] x26: 0000000000000000 x25: ffff0000e1d00040 x24: dead000000000122 [ 1122.391919] x23: dead000000000100 x22: ffff0000e1d031b8 x21: ffff0000e1d00040 [ 1122.392366] x20: 0000000000000000 x19: ffff0000e1d031a8 x18: 0000000000000009 [ 1122.392860] x17: 3a22736665686361 x16: 6362222071772072 x15: 6f66206461657268 [ 1122.393622] x14: 746b207265756373 x13: 52544e49452d203a x12: 0000000000000001 [ 1122.395170] x11: 0000000000000001 x10: 0000000000000000 x9 : 00000000000002d3 [ 1122.396592] x8 : 00000000000003f8 x7 : 0000000000000000 x6 : ffff8000093c2e78 [ 1122.397970] x5 : ffff000209de4240 x4 : ffff0000e1d00030 x3 : dead000000000122 [ 1122.399263] x2 : 00000000000031a8 x1 : 0000000000000000 x0 : 0000000000000000 [ 1122.400473] Call trace: [ 1122.400908] bch2_free_pending_node_rewrites+0x40/0x90 [ 1122.401783] bch2_fs_release+0x48/0x24c [ 1122.402589] kobject_put+0x7c/0xe8 [ 1122.403271] bch2_fs_free+0xa4/0xc8 [ 1122.404033] bch2_fs_alloc+0x5c8/0xbcc [ 1122.404888] bch2_fs_open+0x19c/0x430 [ 1122.405781] bch2_mount+0x194/0x45c [ 1122.406643] legacy_get_tree+0x2c/0x54 [ 1122.407476] vfs_get_tree+0x28/0xd4 [ 1122.408251] path_mount+0x5d0/0x6c8 [ 1122.409056] do_mount+0x80/0xa4 [ 1122.409866] __arm64_sys_mount+0x150/0x168 [ 1122.410904] invoke_syscall.constprop.0+0x70/0xb8 [ 1122.411890] do_el0_svc+0xbc/0xf0 [ 1122.412596] el0_svc+0x74/0x9c [ 1122.413343] el0t_64_sync_handler+0xa8/0x134 [ 1122.414148] el0t_64_sync+0x168/0x16c [ 1122.414863] Code: f2fbd5b7 d2863502 91008af8 8b020273 (f85d8695) [ 1122.415939] ---[ end trace 0000000000000000 ]--- > I wonder if there might be a better way of solving this though? For aio, > when a process is exiting we just synchronously tear down the ioctx, > including waiting for outstanding iocbs. aio is pretty trivial, because the only async it supports is O_DIRECT on regular files which always completes in finite time. io_uring has to cancel etc, so we need to do a lot more. But the concept of my patch should be fine, but I think we must be missing a case. Which is why I started writing a small reproducer instead. I'll pick it up again tomorrow and see what is going on here. > delayed_fput, even though I believe not responsible here, seems sketchy > to me because there doesn't seem to be a straightforward way to flush > delayed fputs for a given _process_ - there's a single global work item, > and we can only flush globally. Yep as mentioned I don't think it's delayed_fput at all. And yeah we can only globally flush that, not per task/files_struct.
On Tue, Jun 27, 2023 at 09:16:31PM -0600, Jens Axboe wrote: > On 6/27/23 2:15?PM, Kent Overstreet wrote: > >> to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept > >> failing because it assumed it was XFS. > >> > >> I suspected this was just a timing issue, and it looks like that's > >> exactly what it is. Looking at the test case, it'll randomly kill -9 > >> fsstress, and if that happens while we have io_uring IO pending, then we > >> process completions inline (for a PF_EXITING current). This means they > >> get pushed to fallback work, which runs out of line. If we hit that case > >> AND the timing is such that it hasn't been processed yet, we'll still be > >> holding a file reference under the mount point and umount will -EBUSY > >> fail. > >> > >> As far as I can tell, this can happen with aio as well, it's just harder > >> to hit. If the fput happens while the task is exiting, then fput will > >> end up being delayed through a workqueue as well. The test case assumes > >> that once it's reaped the exit of the killed task that all files are > >> released, which isn't necessarily true if they are done out-of-line. > > > > Yeah, I traced it through to the delayed fput code as well. > > > > I'm not sure delayed fput is responsible here; what I learned when I was > > tracking this down has mostly fell out of my brain, so take anything I > > say with a large grain of salt. But I believe I tested with delayed_fput > > completely disabled, and found another thing in io_uring with the same > > effect as delayed_fput that wasn't being flushed. > > I'm not saying it's delayed_fput(), I'm saying it's the delayed putting > io_uring can end up doing. But yes, delayed_fput() is another candidate. Sorry - was just working through my recollections/initial thought process out loud > >> For io_uring specifically, it may make sense to wait on the fallback > >> work. The below patch does this, and should fix the issue. But I'm not > >> fully convinced that this is really needed, as I do think this can > >> happen without io_uring as well. It just doesn't right now as the test > >> does buffered IO, and aio will be fully sync with buffered IO. That > >> means there's either no gap where aio will hit it without O_DIRECT, or > >> it's just small enough that it hasn't been hit. > > > > I just tried your patch and I still have generic/388 failing - it > > might've taken a bit longer to pop this time. > > Yep see the same here. Didn't have time to look into it after sending > that email today, just took a quick stab at writing a reproducer and > ended up crashing bcachefs: You must have hit an error before we finished initializing the filesystem, the list head never got initialized. Patch for that will be in the testing branch momentarily. > > I wonder if there might be a better way of solving this though? For aio, > > when a process is exiting we just synchronously tear down the ioctx, > > including waiting for outstanding iocbs. > > aio is pretty trivial, because the only async it supports is O_DIRECT > on regular files which always completes in finite time. io_uring has to > cancel etc, so we need to do a lot more. ahh yes, buffered IO would complicate things > But the concept of my patch should be fine, but I think we must be > missing a case. Which is why I started writing a small reproducer > instead. I'll pick it up again tomorrow and see what is going on here. Ok. Soon as you've got a patch I'll throw it at my CI, or I can point my CI at your branch if you have one.
On 6/27/23 4:05?PM, Dave Chinner wrote: > On Tue, Jun 27, 2023 at 04:15:24PM -0400, Kent Overstreet wrote: >> On Tue, Jun 27, 2023 at 11:16:01AM -0600, Jens Axboe wrote: >>> On 6/26/23 8:59?PM, Jens Axboe wrote: >>>> On 6/26/23 8:05?PM, Kent Overstreet wrote: >>>>> On Mon, Jun 26, 2023 at 07:13:54PM -0600, Jens Axboe wrote: >>>>>> Doesn't reproduce for me with XFS. The above ktest doesn't work for me >>>>>> either: >>>>> >>>>> It just popped for me on xfs, but it took half an hour or so of looping >>>>> vs. 30 seconds on bcachefs. >>>> >>>> OK, I'll try and leave it running overnight and see if I can get it to >>>> trigger. >>> >>> I did manage to reproduce it, and also managed to get bcachefs to run >>> the test. But I had to add: >>> >>> diff --git a/check b/check >>> index 5f9f1a6bec88..6d74bd4933bd 100755 >>> --- a/check >>> +++ b/check >>> @@ -283,7 +283,7 @@ while [ $# -gt 0 ]; do >>> case "$1" in >>> -\? | -h | --help) usage ;; >>> >>> - -nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs) >>> + -nfs|-afs|-glusterfs|-cifs|-9p|-fuse|-virtiofs|-pvfs2|-tmpfs|-ubifs|-bcachefs) >>> FSTYP="${1:1}" >>> ;; >>> -overlay) >> >> I wonder if this is due to an upstream fstests change I haven't seen >> yet, I'll have a look. > > Run mkfs.bcachefs on the testdir first. fstests tries to probe the > filesystem type to test if $FSTYP is not set. If it doesn't find a > filesystem or it is unsupported, it will use the default (i.e. XFS). I did format both test and scratch first with bcachefs, so guessing something is going wrong with figuring out what filesystem is on the device and then it defaults to XFS. I didn't spend too much time on that bit, figured it was easier to just force bcachefs for my purpose. > There should be no reason to need to specify the filesystem type for > filesystems that blkid recognises. from common/config: > > # Autodetect fs type based on what's on $TEST_DEV unless it's been set > # externally > if [ -z "$FSTYP" ] && [ ! -z "$TEST_DEV" ]; then > FSTYP=`blkid -c /dev/null -s TYPE -o value $TEST_DEV` > fi > FSTYP=${FSTYP:=xfs} > export FSTYP Gotcha, yep it's because blkid fails to figure it out.
On 2023-06-28 08:40:27-0600, Jens Axboe wrote: > On 6/27/23 4:05?PM, Dave Chinner wrote: > [..] > > There should be no reason to need to specify the filesystem type for > > filesystems that blkid recognises. from common/config: > > > > # Autodetect fs type based on what's on $TEST_DEV unless it's been set > > # externally > > if [ -z "$FSTYP" ] && [ ! -z "$TEST_DEV" ]; then > > FSTYP=`blkid -c /dev/null -s TYPE -o value $TEST_DEV` > > fi > > FSTYP=${FSTYP:=xfs} > > export FSTYP > > Gotcha, yep it's because blkid fails to figure it out. This needs blkid/util-linux version 2.39 which is fairly new. If it doesn't work with that, it's a bug. Thomas
On 6/27/23 10:01?PM, Kent Overstreet wrote: > On Tue, Jun 27, 2023 at 09:16:31PM -0600, Jens Axboe wrote: >> On 6/27/23 2:15?PM, Kent Overstreet wrote: >>>> to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept >>>> failing because it assumed it was XFS. >>>> >>>> I suspected this was just a timing issue, and it looks like that's >>>> exactly what it is. Looking at the test case, it'll randomly kill -9 >>>> fsstress, and if that happens while we have io_uring IO pending, then we >>>> process completions inline (for a PF_EXITING current). This means they >>>> get pushed to fallback work, which runs out of line. If we hit that case >>>> AND the timing is such that it hasn't been processed yet, we'll still be >>>> holding a file reference under the mount point and umount will -EBUSY >>>> fail. >>>> >>>> As far as I can tell, this can happen with aio as well, it's just harder >>>> to hit. If the fput happens while the task is exiting, then fput will >>>> end up being delayed through a workqueue as well. The test case assumes >>>> that once it's reaped the exit of the killed task that all files are >>>> released, which isn't necessarily true if they are done out-of-line. >>> >>> Yeah, I traced it through to the delayed fput code as well. >>> >>> I'm not sure delayed fput is responsible here; what I learned when I was >>> tracking this down has mostly fell out of my brain, so take anything I >>> say with a large grain of salt. But I believe I tested with delayed_fput >>> completely disabled, and found another thing in io_uring with the same >>> effect as delayed_fput that wasn't being flushed. >> >> I'm not saying it's delayed_fput(), I'm saying it's the delayed putting >> io_uring can end up doing. But yes, delayed_fput() is another candidate. > > Sorry - was just working through my recollections/initial thought > process out loud No worries, it might actually be a combination and this is why my io_uring side patch didn't fully resolve it. Wrote a simple reproducer and it seems to reliably trigger it, but is fixed with an flush of the delayed fput list on mount -EBUSY return. Still digging... >>>> For io_uring specifically, it may make sense to wait on the fallback >>>> work. The below patch does this, and should fix the issue. But I'm not >>>> fully convinced that this is really needed, as I do think this can >>>> happen without io_uring as well. It just doesn't right now as the test >>>> does buffered IO, and aio will be fully sync with buffered IO. That >>>> means there's either no gap where aio will hit it without O_DIRECT, or >>>> it's just small enough that it hasn't been hit. >>> >>> I just tried your patch and I still have generic/388 failing - it >>> might've taken a bit longer to pop this time. >> >> Yep see the same here. Didn't have time to look into it after sending >> that email today, just took a quick stab at writing a reproducer and >> ended up crashing bcachefs: > > You must have hit an error before we finished initializing the > filesystem, the list head never got initialized. Patch for that will be > in the testing branch momentarily. I'll pull that in. In testing just now, I hit a few more leaks: unreferenced object 0xffff0000e55cf200 (size 128): comm "mount", pid 723, jiffies 4294899134 (age 85.868s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<000000001d69062c>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<00000000c503def2>] __kmem_cache_alloc_node+0xd0/0x178 [<00000000cde48528>] __kmalloc+0xac/0xd4 [<000000006cb9446a>] kmalloc_array.constprop.0+0x18/0x20 [<000000008341b32c>] bch2_fs_alloc+0x73c/0xbcc [<000000003b8339fd>] bch2_fs_open+0x19c/0x430 [<00000000aef40a23>] bch2_mount+0x194/0x45c [<0000000005e49357>] legacy_get_tree+0x2c/0x54 [<00000000f5813622>] vfs_get_tree+0x28/0xd4 [<00000000ea6972ec>] path_mount+0x5d0/0x6c8 [<00000000468ec307>] do_mount+0x80/0xa4 [<00000000ea5d305d>] __arm64_sys_mount+0x150/0x168 [<00000000da6d98cb>] invoke_syscall.constprop.0+0x70/0xb8 [<000000008f20c487>] do_el0_svc+0xbc/0xf0 [<00000000a1018c2c>] el0_svc+0x74/0x9c [<00000000fc46d579>] el0t_64_sync_handler+0xa8/0x134 unreferenced object 0xffff0000e55cf580 (size 128): comm "mount", pid 723, jiffies 4294899134 (age 85.868s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<000000001d69062c>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<00000000c503def2>] __kmem_cache_alloc_node+0xd0/0x178 [<00000000cde48528>] __kmalloc+0xac/0xd4 [<0000000097f806f1>] __prealloc_shrinker+0x3c/0x60 [<000000008ff20762>] register_shrinker+0x14/0x34 [<000000007fa7e36c>] bch2_fs_btree_cache_init+0xf8/0x150 [<000000005135a635>] bch2_fs_alloc+0x7ac/0xbcc [<000000003b8339fd>] bch2_fs_open+0x19c/0x430 [<00000000aef40a23>] bch2_mount+0x194/0x45c [<0000000005e49357>] legacy_get_tree+0x2c/0x54 [<00000000f5813622>] vfs_get_tree+0x28/0xd4 [<00000000ea6972ec>] path_mount+0x5d0/0x6c8 [<00000000468ec307>] do_mount+0x80/0xa4 [<00000000ea5d305d>] __arm64_sys_mount+0x150/0x168 [<00000000da6d98cb>] invoke_syscall.constprop.0+0x70/0xb8 [<000000008f20c487>] do_el0_svc+0xbc/0xf0 unreferenced object 0xffff0000e55cf480 (size 128): comm "mount", pid 723, jiffies 4294899134 (age 85.868s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<000000001d69062c>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<00000000c503def2>] __kmem_cache_alloc_node+0xd0/0x178 [<00000000cde48528>] __kmalloc+0xac/0xd4 [<0000000097f806f1>] __prealloc_shrinker+0x3c/0x60 [<000000008ff20762>] register_shrinker+0x14/0x34 [<000000003d050c32>] bch2_fs_btree_key_cache_init+0x88/0x90 [<00000000d9f351c0>] bch2_fs_alloc+0x7c0/0xbcc [<000000003b8339fd>] bch2_fs_open+0x19c/0x430 [<00000000aef40a23>] bch2_mount+0x194/0x45c [<0000000005e49357>] legacy_get_tree+0x2c/0x54 [<00000000f5813622>] vfs_get_tree+0x28/0xd4 [<00000000ea6972ec>] path_mount+0x5d0/0x6c8 [<00000000468ec307>] do_mount+0x80/0xa4 [<00000000ea5d305d>] __arm64_sys_mount+0x150/0x168 [<00000000da6d98cb>] invoke_syscall.constprop.0+0x70/0xb8 [<000000008f20c487>] do_el0_svc+0xbc/0xf0 >>> I wonder if there might be a better way of solving this though? For aio, >>> when a process is exiting we just synchronously tear down the ioctx, >>> including waiting for outstanding iocbs. >> >> aio is pretty trivial, because the only async it supports is O_DIRECT >> on regular files which always completes in finite time. io_uring has to >> cancel etc, so we need to do a lot more. > > ahh yes, buffered IO would complicate things > >> But the concept of my patch should be fine, but I think we must be >> missing a case. Which is why I started writing a small reproducer >> instead. I'll pick it up again tomorrow and see what is going on here. > > Ok. Soon as you've got a patch I'll throw it at my CI, or I can point my > CI at your branch if you have one. I should have something later today, don't feel like I fully understand all of it just yet.
On 6/28/23 8:48 AM, Thomas Weißschuh wrote: > On 2023-06-28 08:40:27-0600, Jens Axboe wrote: >> On 6/27/23 4:05?PM, Dave Chinner wrote: >> [..] > >>> There should be no reason to need to specify the filesystem type for >>> filesystems that blkid recognises. from common/config: >>> >>> # Autodetect fs type based on what's on $TEST_DEV unless it's been set >>> # externally >>> if [ -z "$FSTYP" ] && [ ! -z "$TEST_DEV" ]; then >>> FSTYP=`blkid -c /dev/null -s TYPE -o value $TEST_DEV` >>> fi >>> FSTYP=${FSTYP:=xfs} >>> export FSTYP >> >> Gotcha, yep it's because blkid fails to figure it out. > > This needs blkid/util-linux version 2.39 which is fairly new. > If it doesn't work with that, it's a bug. Got it, looks like I have 2.38.1 here.
On 6/28/23 8:58?AM, Jens Axboe wrote: > I should have something later today, don't feel like I fully understand > all of it just yet. Might indeed be delayed_fput, just the flush is a bit broken in that it races with the worker doing the flush. In any case, with testing that, I hit this before I got an umount failure on loop 6 of generic/388: External UUID: 724c7f1e-fed4-46e8-888a-2d5b170365b7 Internal UUID: 4c356134-e573-4aa4-a7b6-c22ab260e0ff Device index: 0 Label: Version: snapshot_trees Oldest version on disk: snapshot_trees Created: Wed Jun 28 09:16:47 2023 Sequence number: 0 Superblock size: 816 Clean: 0 Devices: 1 Sections: members Features: new_siphash,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes Compat features: Options: block_size: 512 B btree_node_size: 256 KiB errors: continue [ro] panic metadata_replicas: 1 data_replicas: 1 metadata_replicas_required: 1 data_replicas_required: 1 encoded_extent_max: 64.0 KiB metadata_checksum: none [crc32c] crc64 xxhash data_checksum: none [crc32c] crc64 xxhash compression: [none] lz4 gzip zstd background_compression: [none] lz4 gzip zstd str_hash: crc32c crc64 [siphash] metadata_target: none foreground_target: none background_target: none promote_target: none erasure_code: 0 inodes_32bit: 1 shard_inode_numbers: 1 inodes_use_key_cache: 1 gc_reserve_percent: 8 gc_reserve_bytes: 0 B root_reserve_percent: 0 wide_macs: 0 acl: 1 usrquota: 0 grpquota: 0 prjquota: 0 journal_flush_delay: 1000 journal_flush_disabled: 0 journal_reclaim_delay: 100 journal_transaction_names: 1 nocow: 0 members (size 64): Device: 0 UUID: dea79b51-ed22-4f11-9cb9-2117240419df Size: 20.0 GiB Bucket size: 256 KiB First bucket: 0 Buckets: 81920 Last mount: (never) State: rw Label: (none) Data allowed: journal,btree,user Has data: (none) Discard: 0 Freespace initialized: 0 initializing new filesystem going read-write initializing freespace mounted version=snapshot_trees seed = 1687442369 seed = 1687347478 seed = 1687934778 seed = 1687706987 seed = 1687173946 seed = 1687488122 seed = 1687828133 seed = 1687316163 seed = 1687511704 seed = 1687772088 seed = 1688057713 seed = 1687321139 seed = 1687166901 seed = 1687602318 seed = 1687659981 seed = 1687457702 seed = 1688000542 seed = 1687221947 seed = 1687740111 seed = 1688083754 seed = 1687314115 seed = 1687189436 seed = 1687664679 seed = 1687631074 seed = 1687691080 seed = 1688089920 seed = 1687962494 seed = 1687646206 seed = 1687636790 seed = 1687442248 seed = 1687532669 seed = 1687436103 seed = 1687626640 seed = 1687594091 seed = 1687235023 seed = 1687525509 seed = 1687766818 seed = 1688040782 seed = 1687293628 seed = 1687468804 seed = 1688129968 seed = 1687176698 seed = 1687603782 seed = 1687642709 seed = 1687844382 seed = 1687696290 seed = 1688169221 _check_generic_filesystem: filesystem on /dev/nvme0n1 is inconsistent *** fsck.bcachefs output *** fsck from util-linux 2.38.1 recovering from clean shutdown, journal seq 14642 journal read done, replaying entries 14642-14642 checking allocations starting journal replay, 0 keys checking need_discard and freespace btrees checking lrus checking backpointers to alloc keys checking backpointers to extents backpointer for missing extent u64s 9 type backpointer 0:7950303232:0 len 0 ver 0: bucket=0:15164:0 btree=extents l=0 offset=0:0 len=88 pos=1342182431:5745:U32_MAX, not fixing checking extents to backpointers checking alloc to lru refs starting fsck going read-write mounted version=snapshot_trees opts=degraded,fsck,fix_errors,nochanges 0xaaaafeb6b580g: still has errors *** end fsck.bcachefs output *** mount output *** /dev/vda2 on / type ext4 (rw,relatime,errors=remount-ro) devtmpfs on /dev type devtmpfs (rw,relatime,size=8174296k,nr_inodes=2043574,mode=755) proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime) securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) tmpfs on /run type tmpfs (rw,nosuid,nodev,size=3276876k,nr_inodes=819200,mode=755) tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k) cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot) pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime) hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M) mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime) debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime) tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime) configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime) fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime) ramfs on /run/credentials/systemd-sysctl.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700) ramfs on /run/credentials/systemd-sysusers.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700) ramfs on /run/credentials/systemd-tmpfiles-setup-dev.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700) /dev/vda1 on /boot/efi type vfat (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro) ramfs on /run/credentials/systemd-tmpfiles-setup.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700) *** end mount output
On 6/28/23 8:58?AM, Jens Axboe wrote: > On 6/27/23 10:01?PM, Kent Overstreet wrote: >> On Tue, Jun 27, 2023 at 09:16:31PM -0600, Jens Axboe wrote: >>> On 6/27/23 2:15?PM, Kent Overstreet wrote: >>>>> to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept >>>>> failing because it assumed it was XFS. >>>>> >>>>> I suspected this was just a timing issue, and it looks like that's >>>>> exactly what it is. Looking at the test case, it'll randomly kill -9 >>>>> fsstress, and if that happens while we have io_uring IO pending, then we >>>>> process completions inline (for a PF_EXITING current). This means they >>>>> get pushed to fallback work, which runs out of line. If we hit that case >>>>> AND the timing is such that it hasn't been processed yet, we'll still be >>>>> holding a file reference under the mount point and umount will -EBUSY >>>>> fail. >>>>> >>>>> As far as I can tell, this can happen with aio as well, it's just harder >>>>> to hit. If the fput happens while the task is exiting, then fput will >>>>> end up being delayed through a workqueue as well. The test case assumes >>>>> that once it's reaped the exit of the killed task that all files are >>>>> released, which isn't necessarily true if they are done out-of-line. >>>> >>>> Yeah, I traced it through to the delayed fput code as well. >>>> >>>> I'm not sure delayed fput is responsible here; what I learned when I was >>>> tracking this down has mostly fell out of my brain, so take anything I >>>> say with a large grain of salt. But I believe I tested with delayed_fput >>>> completely disabled, and found another thing in io_uring with the same >>>> effect as delayed_fput that wasn't being flushed. >>> >>> I'm not saying it's delayed_fput(), I'm saying it's the delayed putting >>> io_uring can end up doing. But yes, delayed_fput() is another candidate. >> >> Sorry - was just working through my recollections/initial thought >> process out loud > > No worries, it might actually be a combination and this is why my > io_uring side patch didn't fully resolve it. Wrote a simple reproducer > and it seems to reliably trigger it, but is fixed with an flush of the > delayed fput list on mount -EBUSY return. Still digging... I discussed this with Christian offline. I have a patch that is pretty simple, but it does mean that you'd wait for delayed fput flush off umount. Which seems kind of iffy. I think we need to back up a bit and consider if the kill && umount really is sane. If you kill a task that has open files, then any fput from that task will end up being delayed. This means that the umount may very well fail. It'd be handy if we could have umount wait for that to finish, but I'm not at all confident this is a sane solution for all cases. And as discussed, we have no way to even identify which files we'd need to flush out of the delayed list. Maybe the test case just needs fixing? Christian suggested lazy/detach umount and wait for sb release. There's an fsnotify hook for that, fsnotify_sb_delete(). Obviously this is a bit more involved, but seems to me that this would be the way to make it more reliable when killing of tasks with open files are involved.
On Wed, Jun 28, 2023 at 10:57:02AM -0600, Jens Axboe wrote: > On 6/28/23 8:58?AM, Jens Axboe wrote: > > On 6/27/23 10:01?PM, Kent Overstreet wrote: > >> On Tue, Jun 27, 2023 at 09:16:31PM -0600, Jens Axboe wrote: > >>> On 6/27/23 2:15?PM, Kent Overstreet wrote: > >>>>> to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept > >>>>> failing because it assumed it was XFS. > >>>>> > >>>>> I suspected this was just a timing issue, and it looks like that's > >>>>> exactly what it is. Looking at the test case, it'll randomly kill -9 > >>>>> fsstress, and if that happens while we have io_uring IO pending, then we > >>>>> process completions inline (for a PF_EXITING current). This means they > >>>>> get pushed to fallback work, which runs out of line. If we hit that case > >>>>> AND the timing is such that it hasn't been processed yet, we'll still be > >>>>> holding a file reference under the mount point and umount will -EBUSY > >>>>> fail. > >>>>> > >>>>> As far as I can tell, this can happen with aio as well, it's just harder > >>>>> to hit. If the fput happens while the task is exiting, then fput will > >>>>> end up being delayed through a workqueue as well. The test case assumes > >>>>> that once it's reaped the exit of the killed task that all files are > >>>>> released, which isn't necessarily true if they are done out-of-line. > >>>> > >>>> Yeah, I traced it through to the delayed fput code as well. > >>>> > >>>> I'm not sure delayed fput is responsible here; what I learned when I was > >>>> tracking this down has mostly fell out of my brain, so take anything I > >>>> say with a large grain of salt. But I believe I tested with delayed_fput > >>>> completely disabled, and found another thing in io_uring with the same > >>>> effect as delayed_fput that wasn't being flushed. > >>> > >>> I'm not saying it's delayed_fput(), I'm saying it's the delayed putting > >>> io_uring can end up doing. But yes, delayed_fput() is another candidate. > >> > >> Sorry - was just working through my recollections/initial thought > >> process out loud > > > > No worries, it might actually be a combination and this is why my > > io_uring side patch didn't fully resolve it. Wrote a simple reproducer > > and it seems to reliably trigger it, but is fixed with an flush of the > > delayed fput list on mount -EBUSY return. Still digging... > > I discussed this with Christian offline. I have a patch that is pretty > simple, but it does mean that you'd wait for delayed fput flush off > umount. Which seems kind of iffy. > > I think we need to back up a bit and consider if the kill && umount > really is sane. If you kill a task that has open files, then any fput > from that task will end up being delayed. This means that the umount may > very well fail. That's why we have MNT_DETACH: umount2("/mnt", MNT_DETACH) will succeed even if fds are still open. > > It'd be handy if we could have umount wait for that to finish, but I'm > not at all confident this is a sane solution for all cases. And as > discussed, we have no way to even identify which files we'd need to > flush out of the delayed list. > > Maybe the test case just needs fixing? Christian suggested lazy/detach > umount and wait for sb release. There's an fsnotify hook for that, > fsnotify_sb_delete(). Obviously this is a bit more involved, but seems > to me that this would be the way to make it more reliable when killing > of tasks with open files are involved. You can wait on superblock destruction today in multiple ways. Roughly from the shell this should work: root@wittgenstein:~# cat sb_wait.sh #! /bin/bash echo "WARNING WARNING: I SUCK AT SHELL SCRIPTS" echo "mount fs" sudo mount -t tmpfs tmpfs /mnt touch /mnt/bla echo "pin sb by random file for 5s" sleep 5 > /mnt/bla & echo "establish inotify watch for sb destruction" inotifywait -e unmount /mnt & pid=$! echo "regular umount - will fail..." umount /mnt findmnt | grep "/mnt" echo "lazily umount - will succeed" umount -l /mnt findmnt | grep "/mnt" echo "and now we wait" wait $! echo "done" Can also use a tiny bpf lsm, fanotify in the future as we plans for that.
On Wed, Jun 28, 2023 at 10:57:02AM -0600, Jens Axboe wrote: > I discussed this with Christian offline. I have a patch that is pretty > simple, but it does mean that you'd wait for delayed fput flush off > umount. Which seems kind of iffy. > > I think we need to back up a bit and consider if the kill && umount > really is sane. If you kill a task that has open files, then any fput > from that task will end up being delayed. This means that the umount may > very well fail. > > It'd be handy if we could have umount wait for that to finish, but I'm > not at all confident this is a sane solution for all cases. And as > discussed, we have no way to even identify which files we'd need to > flush out of the delayed list. > > Maybe the test case just needs fixing? Christian suggested lazy/detach > umount and wait for sb release. There's an fsnotify hook for that, > fsnotify_sb_delete(). Obviously this is a bit more involved, but seems > to me that this would be the way to make it more reliable when killing > of tasks with open files are involved. No, this is a real breakage. Any time we introduce unexpected asynchrony there's the potential for breakage: case in point, there was a filesystem that made rm asynchronous, then there were scripts out there that deleted until df showed under some threshold.. whoops... this would break anyone that does fuser; umount; and making the umount lazy just moves the race to the next thing that uses the block device. I'd like to know how delayed_fput() avoids this.
On Wed, Jun 28, 2023 at 08:58:45AM -0600, Jens Axboe wrote: > On 6/27/23 10:01?PM, Kent Overstreet wrote: > > On Tue, Jun 27, 2023 at 09:16:31PM -0600, Jens Axboe wrote: > >> On 6/27/23 2:15?PM, Kent Overstreet wrote: > >>>> to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept > >>>> failing because it assumed it was XFS. > >>>> > >>>> I suspected this was just a timing issue, and it looks like that's > >>>> exactly what it is. Looking at the test case, it'll randomly kill -9 > >>>> fsstress, and if that happens while we have io_uring IO pending, then we > >>>> process completions inline (for a PF_EXITING current). This means they > >>>> get pushed to fallback work, which runs out of line. If we hit that case > >>>> AND the timing is such that it hasn't been processed yet, we'll still be > >>>> holding a file reference under the mount point and umount will -EBUSY > >>>> fail. > >>>> > >>>> As far as I can tell, this can happen with aio as well, it's just harder > >>>> to hit. If the fput happens while the task is exiting, then fput will > >>>> end up being delayed through a workqueue as well. The test case assumes > >>>> that once it's reaped the exit of the killed task that all files are > >>>> released, which isn't necessarily true if they are done out-of-line. > >>> > >>> Yeah, I traced it through to the delayed fput code as well. > >>> > >>> I'm not sure delayed fput is responsible here; what I learned when I was > >>> tracking this down has mostly fell out of my brain, so take anything I > >>> say with a large grain of salt. But I believe I tested with delayed_fput > >>> completely disabled, and found another thing in io_uring with the same > >>> effect as delayed_fput that wasn't being flushed. > >> > >> I'm not saying it's delayed_fput(), I'm saying it's the delayed putting > >> io_uring can end up doing. But yes, delayed_fput() is another candidate. > > > > Sorry - was just working through my recollections/initial thought > > process out loud > > No worries, it might actually be a combination and this is why my > io_uring side patch didn't fully resolve it. Wrote a simple reproducer > and it seems to reliably trigger it, but is fixed with an flush of the > delayed fput list on mount -EBUSY return. Still digging... > > >>>> For io_uring specifically, it may make sense to wait on the fallback > >>>> work. The below patch does this, and should fix the issue. But I'm not > >>>> fully convinced that this is really needed, as I do think this can > >>>> happen without io_uring as well. It just doesn't right now as the test > >>>> does buffered IO, and aio will be fully sync with buffered IO. That > >>>> means there's either no gap where aio will hit it without O_DIRECT, or > >>>> it's just small enough that it hasn't been hit. > >>> > >>> I just tried your patch and I still have generic/388 failing - it > >>> might've taken a bit longer to pop this time. > >> > >> Yep see the same here. Didn't have time to look into it after sending > >> that email today, just took a quick stab at writing a reproducer and > >> ended up crashing bcachefs: > > > > You must have hit an error before we finished initializing the > > filesystem, the list head never got initialized. Patch for that will be > > in the testing branch momentarily. > > I'll pull that in. In testing just now, I hit a few more leaks: > > unreferenced object 0xffff0000e55cf200 (size 128): > comm "mount", pid 723, jiffies 4294899134 (age 85.868s) > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > backtrace: > [<000000001d69062c>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<00000000c503def2>] __kmem_cache_alloc_node+0xd0/0x178 > [<00000000cde48528>] __kmalloc+0xac/0xd4 > [<000000006cb9446a>] kmalloc_array.constprop.0+0x18/0x20 > [<000000008341b32c>] bch2_fs_alloc+0x73c/0xbcc Can you faddr2line this? I just did a bunch of kmemleak testing and didn't see it. > unreferenced object 0xffff0000e55cf480 (size 128): > comm "mount", pid 723, jiffies 4294899134 (age 85.868s) > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > backtrace: > [<000000001d69062c>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<00000000c503def2>] __kmem_cache_alloc_node+0xd0/0x178 > [<00000000cde48528>] __kmalloc+0xac/0xd4 > [<0000000097f806f1>] __prealloc_shrinker+0x3c/0x60 > [<000000008ff20762>] register_shrinker+0x14/0x34 > [<000000003d050c32>] bch2_fs_btree_key_cache_init+0x88/0x90 > [<00000000d9f351c0>] bch2_fs_alloc+0x7c0/0xbcc > [<000000003b8339fd>] bch2_fs_open+0x19c/0x430 > [<00000000aef40a23>] bch2_mount+0x194/0x45c > [<0000000005e49357>] legacy_get_tree+0x2c/0x54 > [<00000000f5813622>] vfs_get_tree+0x28/0xd4 > [<00000000ea6972ec>] path_mount+0x5d0/0x6c8 > [<00000000468ec307>] do_mount+0x80/0xa4 > [<00000000ea5d305d>] __arm64_sys_mount+0x150/0x168 > [<00000000da6d98cb>] invoke_syscall.constprop.0+0x70/0xb8 > [<000000008f20c487>] do_el0_svc+0xbc/0xf0 This one is actually a bug in unregister_shrinker(), I have a patch I'll have to send to Andrew.
On Wed, Jun 28, 2023 at 09:22:15AM -0600, Jens Axboe wrote: > On 6/28/23 8:58?AM, Jens Axboe wrote: > > I should have something later today, don't feel like I fully understand > > all of it just yet. > > Might indeed be delayed_fput, just the flush is a bit broken in that it > races with the worker doing the flush. In any case, with testing that, I > hit this before I got an umount failure on loop 6 of generic/388: > > fsck from util-linux 2.38.1 > recovering from clean shutdown, journal seq 14642 > journal read done, replaying entries 14642-14642 > checking allocations > starting journal replay, 0 keys > checking need_discard and freespace btrees > checking lrus > checking backpointers to alloc keys > checking backpointers to extents > backpointer for missing extent > u64s 9 type backpointer 0:7950303232:0 len 0 ver 0: bucket=0:15164:0 btree=extents l=0 offset=0:0 len=88 pos=1342182431:5745:U32_MAX, not fixing Known bug, but it's gotten difficult to reproduce - if generic/388 ends up being a better reproducer for this that'll be nice
On 6/28/23 11:52?AM, Kent Overstreet wrote: > On Wed, Jun 28, 2023 at 10:57:02AM -0600, Jens Axboe wrote: >> I discussed this with Christian offline. I have a patch that is pretty >> simple, but it does mean that you'd wait for delayed fput flush off >> umount. Which seems kind of iffy. >> >> I think we need to back up a bit and consider if the kill && umount >> really is sane. If you kill a task that has open files, then any fput >> from that task will end up being delayed. This means that the umount may >> very well fail. >> >> It'd be handy if we could have umount wait for that to finish, but I'm >> not at all confident this is a sane solution for all cases. And as >> discussed, we have no way to even identify which files we'd need to >> flush out of the delayed list. >> >> Maybe the test case just needs fixing? Christian suggested lazy/detach >> umount and wait for sb release. There's an fsnotify hook for that, >> fsnotify_sb_delete(). Obviously this is a bit more involved, but seems >> to me that this would be the way to make it more reliable when killing >> of tasks with open files are involved. > > No, this is a real breakage. Any time we introduce unexpected > asynchrony there's the potential for breakage: case in point, there was > a filesystem that made rm asynchronous, then there were scripts out > there that deleted until df showed under some threshold.. whoops... This is nothing new - any fput done from an exiting task will end up being deferred. The window may be a bit wider now or a bit different, but it's the same window. If an application assumes it can kill && wait on a task and be guaranteed that the files are released as soon as wait returns, it is mistaken. That is NOT the case. > this would break anyone that does fuser; umount; and making the umount > lazy just moves the race to the next thing that uses the block device. > > I'd like to know how delayed_fput() avoids this. What is "this" here? The delayed fput list is processed async, so it's really down to timing for the size of the window. Either the 388 test is fixed so that it monitors for sb release like Christian described, or we can paper around it with a sync and sleep or something like that. The former would obviously be a lot more reliable.
On 6/28/23 11:56?AM, Kent Overstreet wrote: > On Wed, Jun 28, 2023 at 09:22:15AM -0600, Jens Axboe wrote: >> On 6/28/23 8:58?AM, Jens Axboe wrote: >>> I should have something later today, don't feel like I fully understand >>> all of it just yet. >> >> Might indeed be delayed_fput, just the flush is a bit broken in that it >> races with the worker doing the flush. In any case, with testing that, I >> hit this before I got an umount failure on loop 6 of generic/388: >> >> fsck from util-linux 2.38.1 >> recovering from clean shutdown, journal seq 14642 >> journal read done, replaying entries 14642-14642 >> checking allocations >> starting journal replay, 0 keys >> checking need_discard and freespace btrees >> checking lrus >> checking backpointers to alloc keys >> checking backpointers to extents >> backpointer for missing extent >> u64s 9 type backpointer 0:7950303232:0 len 0 ver 0: bucket=0:15164:0 btree=extents l=0 offset=0:0 len=88 pos=1342182431:5745:U32_MAX, not fixing > > Known bug, but it's gotten difficult to reproduce - if generic/388 ends > up being a better reproducer for this that'll be nice Seems to reproduce in anywhere from 1..4 iterations for me.
On 6/28/23 11:54?AM, Kent Overstreet wrote: > On Wed, Jun 28, 2023 at 08:58:45AM -0600, Jens Axboe wrote: >> On 6/27/23 10:01?PM, Kent Overstreet wrote: >>> On Tue, Jun 27, 2023 at 09:16:31PM -0600, Jens Axboe wrote: >>>> On 6/27/23 2:15?PM, Kent Overstreet wrote: >>>>>> to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept >>>>>> failing because it assumed it was XFS. >>>>>> >>>>>> I suspected this was just a timing issue, and it looks like that's >>>>>> exactly what it is. Looking at the test case, it'll randomly kill -9 >>>>>> fsstress, and if that happens while we have io_uring IO pending, then we >>>>>> process completions inline (for a PF_EXITING current). This means they >>>>>> get pushed to fallback work, which runs out of line. If we hit that case >>>>>> AND the timing is such that it hasn't been processed yet, we'll still be >>>>>> holding a file reference under the mount point and umount will -EBUSY >>>>>> fail. >>>>>> >>>>>> As far as I can tell, this can happen with aio as well, it's just harder >>>>>> to hit. If the fput happens while the task is exiting, then fput will >>>>>> end up being delayed through a workqueue as well. The test case assumes >>>>>> that once it's reaped the exit of the killed task that all files are >>>>>> released, which isn't necessarily true if they are done out-of-line. >>>>> >>>>> Yeah, I traced it through to the delayed fput code as well. >>>>> >>>>> I'm not sure delayed fput is responsible here; what I learned when I was >>>>> tracking this down has mostly fell out of my brain, so take anything I >>>>> say with a large grain of salt. But I believe I tested with delayed_fput >>>>> completely disabled, and found another thing in io_uring with the same >>>>> effect as delayed_fput that wasn't being flushed. >>>> >>>> I'm not saying it's delayed_fput(), I'm saying it's the delayed putting >>>> io_uring can end up doing. But yes, delayed_fput() is another candidate. >>> >>> Sorry - was just working through my recollections/initial thought >>> process out loud >> >> No worries, it might actually be a combination and this is why my >> io_uring side patch didn't fully resolve it. Wrote a simple reproducer >> and it seems to reliably trigger it, but is fixed with an flush of the >> delayed fput list on mount -EBUSY return. Still digging... >> >>>>>> For io_uring specifically, it may make sense to wait on the fallback >>>>>> work. The below patch does this, and should fix the issue. But I'm not >>>>>> fully convinced that this is really needed, as I do think this can >>>>>> happen without io_uring as well. It just doesn't right now as the test >>>>>> does buffered IO, and aio will be fully sync with buffered IO. That >>>>>> means there's either no gap where aio will hit it without O_DIRECT, or >>>>>> it's just small enough that it hasn't been hit. >>>>> >>>>> I just tried your patch and I still have generic/388 failing - it >>>>> might've taken a bit longer to pop this time. >>>> >>>> Yep see the same here. Didn't have time to look into it after sending >>>> that email today, just took a quick stab at writing a reproducer and >>>> ended up crashing bcachefs: >>> >>> You must have hit an error before we finished initializing the >>> filesystem, the list head never got initialized. Patch for that will be >>> in the testing branch momentarily. >> >> I'll pull that in. In testing just now, I hit a few more leaks: >> >> unreferenced object 0xffff0000e55cf200 (size 128): >> comm "mount", pid 723, jiffies 4294899134 (age 85.868s) >> hex dump (first 32 bytes): >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ >> backtrace: >> [<000000001d69062c>] slab_post_alloc_hook.isra.0+0xb4/0xbc >> [<00000000c503def2>] __kmem_cache_alloc_node+0xd0/0x178 >> [<00000000cde48528>] __kmalloc+0xac/0xd4 >> [<000000006cb9446a>] kmalloc_array.constprop.0+0x18/0x20 >> [<000000008341b32c>] bch2_fs_alloc+0x73c/0xbcc > > Can you faddr2line this? I just did a bunch of kmemleak testing and > didn't see it. 0xffff800008589a20 is in bch2_fs_alloc (fs/bcachefs/super.c:813). 808 !(c->online_reserved = alloc_percpu(u64)) || 809 !(c->btree_paths_bufs = alloc_percpu(struct btree_path_buf)) || 810 mempool_init_kvpmalloc_pool(&c->btree_bounce_pool, 1, 811 btree_bytes(c)) || 812 mempool_init_kmalloc_pool(&c->large_bkey_pool, 1, 2048) || 813 !(c->unused_inode_hints = kcalloc(1U << c->inode_shard_bits, 814 sizeof(u64), GFP_KERNEL))) { 815 ret = -BCH_ERR_ENOMEM_fs_other_alloc; 816 goto err; 817 }
On 6/28/23 2:44?PM, Jens Axboe wrote: > On 6/28/23 11:52?AM, Kent Overstreet wrote: >> On Wed, Jun 28, 2023 at 10:57:02AM -0600, Jens Axboe wrote: >>> I discussed this with Christian offline. I have a patch that is pretty >>> simple, but it does mean that you'd wait for delayed fput flush off >>> umount. Which seems kind of iffy. >>> >>> I think we need to back up a bit and consider if the kill && umount >>> really is sane. If you kill a task that has open files, then any fput >>> from that task will end up being delayed. This means that the umount may >>> very well fail. >>> >>> It'd be handy if we could have umount wait for that to finish, but I'm >>> not at all confident this is a sane solution for all cases. And as >>> discussed, we have no way to even identify which files we'd need to >>> flush out of the delayed list. >>> >>> Maybe the test case just needs fixing? Christian suggested lazy/detach >>> umount and wait for sb release. There's an fsnotify hook for that, >>> fsnotify_sb_delete(). Obviously this is a bit more involved, but seems >>> to me that this would be the way to make it more reliable when killing >>> of tasks with open files are involved. >> >> No, this is a real breakage. Any time we introduce unexpected >> asynchrony there's the potential for breakage: case in point, there was >> a filesystem that made rm asynchronous, then there were scripts out >> there that deleted until df showed under some threshold.. whoops... > > This is nothing new - any fput done from an exiting task will end up > being deferred. The window may be a bit wider now or a bit different, > but it's the same window. If an application assumes it can kill && wait > on a task and be guaranteed that the files are released as soon as wait > returns, it is mistaken. That is NOT the case. Case in point, just changed my reproducer to use aio instead of io_uring. Here's the full script: #!/bin/bash DEV=/dev/nvme1n1 MNT=/data ITER=0 while true; do echo loop $ITER sudo mount $DEV $MNT fio --name=test --ioengine=aio --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --output=/dev/null & Y=$(($RANDOM % 3)) X=$(($RANDOM % 10)) VAL="$Y.$X" sleep $VAL ps -e | grep fio > /dev/null 2>&1 while [ $? -eq 0 ]; do killall -9 fio > /dev/null 2>&1 echo will wait wait > /dev/null 2>&1 echo done waiting ps -e | grep "fio " > /dev/null 2>&1 done sudo umount /data if [ $? -ne 0 ]; then break fi ((ITER++)) done and if I run that, fails on the first umount attempt in that loop: axboe@m1max-kvm ~> bash test2.sh loop 0 will wait done waiting umount: /data: target is busy. So yeah, this is _nothing_ new. I really don't think trying to address this in the kernel is the right approach, it'd be a lot saner to harden the xfstest side to deal with the umount a bit more sanely. There are obviously tons of other ways that a mount could get pinned, which isn't too relevant here since the bdev and mount point are basically exclusive to the test being run. But the kill and delayed fput is enough to make that case imho.
On Wed, Jun 28, 2023 at 03:17:43PM -0600, Jens Axboe wrote: > Case in point, just changed my reproducer to use aio instead of > io_uring. Here's the full script: > > #!/bin/bash > > DEV=/dev/nvme1n1 > MNT=/data > ITER=0 > > while true; do > echo loop $ITER > sudo mount $DEV $MNT > fio --name=test --ioengine=aio --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --output=/dev/null & > Y=$(($RANDOM % 3)) > X=$(($RANDOM % 10)) > VAL="$Y.$X" > sleep $VAL > ps -e | grep fio > /dev/null 2>&1 > while [ $? -eq 0 ]; do > killall -9 fio > /dev/null 2>&1 > echo will wait > wait > /dev/null 2>&1 > echo done waiting > ps -e | grep "fio " > /dev/null 2>&1 > done > sudo umount /data > if [ $? -ne 0 ]; then > break > fi > ((ITER++)) > done > > and if I run that, fails on the first umount attempt in that loop: > > axboe@m1max-kvm ~> bash test2.sh > loop 0 > will wait > done waiting > umount: /data: target is busy. > > So yeah, this is _nothing_ new. I really don't think trying to address > this in the kernel is the right approach, it'd be a lot saner to harden > the xfstest side to deal with the umount a bit more sanely. There are > obviously tons of other ways that a mount could get pinned, which isn't > too relevant here since the bdev and mount point are basically exclusive > to the test being run. But the kill and delayed fput is enough to make > that case imho. Uh, count me very much not in favor of hacking around bugs elsewhere. Al, do you know if this has been considered before? We've got fput() being called from aio completion, which often runs out of a worqueue (if not a workqueue, a bottom half of some sort - what happens then, I wonder) - so the effect is that it goes on the global list, not the task work list. hence, kill -9ing a process doing aio (or io_uring io, for extra reasons) causes umount to fail with -EBUSY. and since there's no mechanism for userspace to deal with this besides sleep and retry, this seems pretty gross. I'd be willing to tackle this for aio since I know that code...
On 6/28/23 2:54?PM, Jens Axboe wrote: > On 6/28/23 11:54?AM, Kent Overstreet wrote: >> On Wed, Jun 28, 2023 at 08:58:45AM -0600, Jens Axboe wrote: >>> On 6/27/23 10:01?PM, Kent Overstreet wrote: >>>> On Tue, Jun 27, 2023 at 09:16:31PM -0600, Jens Axboe wrote: >>>>> On 6/27/23 2:15?PM, Kent Overstreet wrote: >>>>>>> to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept >>>>>>> failing because it assumed it was XFS. >>>>>>> >>>>>>> I suspected this was just a timing issue, and it looks like that's >>>>>>> exactly what it is. Looking at the test case, it'll randomly kill -9 >>>>>>> fsstress, and if that happens while we have io_uring IO pending, then we >>>>>>> process completions inline (for a PF_EXITING current). This means they >>>>>>> get pushed to fallback work, which runs out of line. If we hit that case >>>>>>> AND the timing is such that it hasn't been processed yet, we'll still be >>>>>>> holding a file reference under the mount point and umount will -EBUSY >>>>>>> fail. >>>>>>> >>>>>>> As far as I can tell, this can happen with aio as well, it's just harder >>>>>>> to hit. If the fput happens while the task is exiting, then fput will >>>>>>> end up being delayed through a workqueue as well. The test case assumes >>>>>>> that once it's reaped the exit of the killed task that all files are >>>>>>> released, which isn't necessarily true if they are done out-of-line. >>>>>> >>>>>> Yeah, I traced it through to the delayed fput code as well. >>>>>> >>>>>> I'm not sure delayed fput is responsible here; what I learned when I was >>>>>> tracking this down has mostly fell out of my brain, so take anything I >>>>>> say with a large grain of salt. But I believe I tested with delayed_fput >>>>>> completely disabled, and found another thing in io_uring with the same >>>>>> effect as delayed_fput that wasn't being flushed. >>>>> >>>>> I'm not saying it's delayed_fput(), I'm saying it's the delayed putting >>>>> io_uring can end up doing. But yes, delayed_fput() is another candidate. >>>> >>>> Sorry - was just working through my recollections/initial thought >>>> process out loud >>> >>> No worries, it might actually be a combination and this is why my >>> io_uring side patch didn't fully resolve it. Wrote a simple reproducer >>> and it seems to reliably trigger it, but is fixed with an flush of the >>> delayed fput list on mount -EBUSY return. Still digging... >>> >>>>>>> For io_uring specifically, it may make sense to wait on the fallback >>>>>>> work. The below patch does this, and should fix the issue. But I'm not >>>>>>> fully convinced that this is really needed, as I do think this can >>>>>>> happen without io_uring as well. It just doesn't right now as the test >>>>>>> does buffered IO, and aio will be fully sync with buffered IO. That >>>>>>> means there's either no gap where aio will hit it without O_DIRECT, or >>>>>>> it's just small enough that it hasn't been hit. >>>>>> >>>>>> I just tried your patch and I still have generic/388 failing - it >>>>>> might've taken a bit longer to pop this time. >>>>> >>>>> Yep see the same here. Didn't have time to look into it after sending >>>>> that email today, just took a quick stab at writing a reproducer and >>>>> ended up crashing bcachefs: >>>> >>>> You must have hit an error before we finished initializing the >>>> filesystem, the list head never got initialized. Patch for that will be >>>> in the testing branch momentarily. >>> >>> I'll pull that in. In testing just now, I hit a few more leaks: >>> >>> unreferenced object 0xffff0000e55cf200 (size 128): >>> comm "mount", pid 723, jiffies 4294899134 (age 85.868s) >>> hex dump (first 32 bytes): >>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ >>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ >>> backtrace: >>> [<000000001d69062c>] slab_post_alloc_hook.isra.0+0xb4/0xbc >>> [<00000000c503def2>] __kmem_cache_alloc_node+0xd0/0x178 >>> [<00000000cde48528>] __kmalloc+0xac/0xd4 >>> [<000000006cb9446a>] kmalloc_array.constprop.0+0x18/0x20 >>> [<000000008341b32c>] bch2_fs_alloc+0x73c/0xbcc >> >> Can you faddr2line this? I just did a bunch of kmemleak testing and >> didn't see it. > > 0xffff800008589a20 is in bch2_fs_alloc (fs/bcachefs/super.c:813). > 808 !(c->online_reserved = alloc_percpu(u64)) || > 809 !(c->btree_paths_bufs = alloc_percpu(struct btree_path_buf)) || > 810 mempool_init_kvpmalloc_pool(&c->btree_bounce_pool, 1, > 811 btree_bytes(c)) || > 812 mempool_init_kmalloc_pool(&c->large_bkey_pool, 1, 2048) || > 813 !(c->unused_inode_hints = kcalloc(1U << c->inode_shard_bits, > 814 sizeof(u64), GFP_KERNEL))) { > 815 ret = -BCH_ERR_ENOMEM_fs_other_alloc; > 816 goto err; > 817 } Got a whole bunch more running that aio reproducer I sent earlier. I'm sure a lot of these are dupes, sending them here for completeness. [ 677.739815] kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak) [ 1283.963249] kmemleak: 37 new suspected memory leaks (see /sys/kernel/debug/kmemleak) unreferenced object 0xffff0000e35de000 (size 8192): comm "mount", pid 3049, jiffies 4294924385 (age 3938.092s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 1d 00 1d 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000005602d414>] __kmalloc_node_track_caller+0xa8/0xd0 [<0000000078a13296>] krealloc+0x7c/0xc4 [<00000000f1fea4ad>] bch2_sb_realloc+0x12c/0x150 [<00000000f03d5ce6>] __copy_super+0x104/0x17c [<000000005567521f>] bch2_sb_to_fs+0x3c/0x80 [<0000000062d4e9f6>] bch2_fs_alloc+0x410/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 unreferenced object 0xffff00020a209900 (size 128): comm "mount", pid 3049, jiffies 4294924385 (age 3938.092s) hex dump (first 32 bytes): 03 01 01 00 02 01 01 00 04 01 01 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<00000000cd9515c0>] __kmalloc+0xac/0xd4 [<00000000fcf82258>] kmalloc_array.constprop.0+0x18/0x20 [<00000000182c3be4>] __bch2_sb_replicas_v0_to_cpu_replicas+0x50/0x118 [<0000000012583a94>] bch2_sb_replicas_to_cpu_replicas+0xb0/0xc0 [<00000000fcd0b373>] bch2_sb_to_fs+0x4c/0x80 [<0000000062d4e9f6>] bch2_fs_alloc+0x410/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 unreferenced object 0xffff000206785400 (size 128): comm "mount", pid 3049, jiffies 4294924391 (age 3938.068s) hex dump (first 32 bytes): 00 00 d9 20 02 00 ff ff 01 00 00 00 01 04 00 00 ... ............ 01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00 ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000007b360995>] __kmalloc_node+0xac/0xd4 [<0000000050ae8904>] mempool_init_node+0x64/0xd8 [<00000000e714c59a>] mempool_init+0x14/0x1c [<00000000bb95f8a0>] bch2_fs_alloc+0x690/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 [<00000000b4ee996a>] el0_svc+0x74/0x9c unreferenced object 0xffff000206785700 (size 128): comm "mount", pid 3049, jiffies 4294924391 (age 3938.076s) hex dump (first 32 bytes): 00 00 96 2d 02 00 ff ff 01 00 00 00 01 04 00 00 ...-............ 01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00 ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000007b360995>] __kmalloc_node+0xac/0xd4 [<0000000050ae8904>] mempool_init_node+0x64/0xd8 [<00000000e714c59a>] mempool_init+0x14/0x1c [<0000000089ab54c3>] bch2_fs_replicas_init+0x64/0xac [<0000000056c4a5fe>] bch2_fs_alloc+0x79c/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 unreferenced object 0xffff000206785600 (size 128): comm "mount", pid 3049, jiffies 4294924391 (age 3938.076s) hex dump (first 32 bytes): 00 1a 05 00 00 00 00 00 00 0c 02 00 00 00 00 00 ................ 42 9c ba 00 00 00 00 00 00 00 00 00 00 00 00 00 B............... backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<00000000cd9515c0>] __kmalloc+0xac/0xd4 [<00000000f949dcc7>] replicas_table_update+0x84/0x214 [<000000002debc89d>] bch2_fs_replicas_init+0x74/0xac [<0000000056c4a5fe>] bch2_fs_alloc+0x79c/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 [<00000000b4ee996a>] el0_svc+0x74/0x9c unreferenced object 0xffff000206785580 (size 128): comm "mount", pid 3049, jiffies 4294924391 (age 3938.076s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 01 00 00 00 01 04 00 00 ................ 01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00 ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<00000000cd9515c0>] __kmalloc+0xac/0xd4 [<00000000639b7f33>] replicas_table_update+0x98/0x214 [<000000002debc89d>] bch2_fs_replicas_init+0x74/0xac [<0000000056c4a5fe>] bch2_fs_alloc+0x79c/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 [<00000000b4ee996a>] el0_svc+0x74/0x9c unreferenced object 0xffff000206785080 (size 128): comm "mount", pid 3049, jiffies 4294924391 (age 3938.088s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<00000000cd9515c0>] __kmalloc+0xac/0xd4 [<000000001335974a>] __prealloc_shrinker+0x3c/0x60 [<0000000017b0bc26>] register_shrinker+0x14/0x34 [<00000000c07d01d7>] bch2_fs_btree_cache_init+0xf8/0x150 [<000000004b948640>] bch2_fs_alloc+0x7ac/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 unreferenced object 0xffff000200f2ec00 (size 1024): comm "mount", pid 3049, jiffies 4294924391 (age 3938.088s) hex dump (first 32 bytes): 40 00 00 00 00 00 00 00 a8 66 18 09 00 00 00 00 @........f...... 10 ec f2 00 02 00 ff ff 10 ec f2 00 02 00 ff ff ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000007b360995>] __kmalloc_node+0xac/0xd4 [<0000000066405974>] kvmalloc_node+0x54/0xe4 [<00000000a51f16c9>] bucket_table_alloc.isra.0+0x44/0x120 [<0000000000df2e94>] rhashtable_init+0x148/0x1ac [<0000000080f397f7>] bch2_fs_btree_key_cache_init+0x48/0x90 [<0000000089e6783c>] bch2_fs_alloc+0x7c0/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 unreferenced object 0xffff000206785b80 (size 128): comm "mount", pid 3049, jiffies 4294924391 (age 3938.088s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<00000000cd9515c0>] __kmalloc+0xac/0xd4 [<000000001335974a>] __prealloc_shrinker+0x3c/0x60 [<0000000017b0bc26>] register_shrinker+0x14/0x34 [<00000000228dd43a>] bch2_fs_btree_key_cache_init+0x88/0x90 [<0000000089e6783c>] bch2_fs_alloc+0x7c0/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 unreferenced object 0xffff000206785500 (size 128): comm "mount", pid 3049, jiffies 4294924391 (age 3938.096s) hex dump (first 32 bytes): 00 00 20 2b 02 00 ff ff 01 00 00 00 01 04 00 00 .. +............ 01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00 ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000007b360995>] __kmalloc_node+0xac/0xd4 [<0000000050ae8904>] mempool_init_node+0x64/0xd8 [<00000000e714c59a>] mempool_init+0x14/0x1c [<00000000fc134979>] bch2_fs_btree_iter_init+0x98/0x130 [<00000000addf57f5>] bch2_fs_alloc+0x7d0/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 unreferenced object 0xffff000206785480 (size 128): comm "mount", pid 3049, jiffies 4294924391 (age 3938.096s) hex dump (first 32 bytes): 00 00 97 05 02 00 ff ff 01 00 00 00 01 04 00 00 ................ 01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00 ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000007b360995>] __kmalloc_node+0xac/0xd4 [<0000000050ae8904>] mempool_init_node+0x64/0xd8 [<00000000e714c59a>] mempool_init+0x14/0x1c [<000000004d03e2b7>] bch2_fs_btree_iter_init+0xb8/0x130 [<00000000addf57f5>] bch2_fs_alloc+0x7d0/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 unreferenced object 0xffff000230a31a00 (size 512): comm "mount", pid 3049, jiffies 4294924391 (age 3938.096s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000009502ae7b>] kmalloc_trace+0x38/0x78 [<0000000060cbc45a>] init_srcu_struct_fields+0x38/0x284 [<00000000643a7c95>] init_srcu_struct+0x10/0x18 [<00000000c46c2041>] bch2_fs_btree_iter_init+0xc8/0x130 [<00000000addf57f5>] bch2_fs_alloc+0x7d0/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 unreferenced object 0xffff000222f14f00 (size 256): comm "mount", pid 3049, jiffies 4294924391 (age 3938.100s) hex dump (first 32 bytes): 03 00 00 00 01 00 ff ff 1a cf e0 f2 17 b2 a8 24 ...............$ cf 4a ba c3 fb 05 19 cd f6 4d f5 45 e7 e8 29 eb .J.......M.E..). backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000007b360995>] __kmalloc_node+0xac/0xd4 [<0000000066405974>] kvmalloc_node+0x54/0xe4 [<00000000c83b22ef>] bch2_fs_buckets_waiting_for_journal_init+0x44/0x6c [<0000000026230712>] bch2_fs_alloc+0x7f0/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 [<00000000b4ee996a>] el0_svc+0x74/0x9c unreferenced object 0xffff000230e00000 (size 720896): comm "mount", pid 3049, jiffies 4294924391 (age 3938.100s) hex dump (first 32 bytes): 40 71 26 38 00 00 00 00 00 00 00 00 00 00 00 00 @q&8............ d2 17 f9 2f 75 7e 51 2a 01 01 00 00 14 00 00 00 .../u~Q*........ backtrace: [<00000000c6d9e620>] __kmalloc_large_node+0x134/0x164 [<000000009024f86b>] __kmalloc_node+0x34/0xd4 [<0000000066405974>] kvmalloc_node+0x54/0xe4 [<00000000729eb36b>] bch2_fs_btree_write_buffer_init+0x58/0xb4 [<000000003e35ba10>] bch2_fs_alloc+0x800/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 [<00000000b4ee996a>] el0_svc+0x74/0x9c [<00000000a22b66b5>] el0t_64_sync_handler+0xa8/0x134 unreferenced object 0xffff000230900000 (size 720896): comm "mount", pid 3049, jiffies 4294924391 (age 3938.100s) hex dump (first 32 bytes): 88 96 28 f7 00 00 00 00 00 00 00 00 00 00 00 00 ..(............. d2 17 f9 2f 75 7e 51 2a 01 01 00 00 13 00 00 00 .../u~Q*........ backtrace: [<00000000c6d9e620>] __kmalloc_large_node+0x134/0x164 [<000000009024f86b>] __kmalloc_node+0x34/0xd4 [<0000000066405974>] kvmalloc_node+0x54/0xe4 [<00000000f27707f5>] bch2_fs_btree_write_buffer_init+0x7c/0xb4 [<000000003e35ba10>] bch2_fs_alloc+0x800/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 [<00000000b4ee996a>] el0_svc+0x74/0x9c [<00000000a22b66b5>] el0t_64_sync_handler+0xa8/0x134 unreferenced object 0xffff0000c8d1e300 (size 128): comm "mount", pid 3049, jiffies 4294924391 (age 3938.108s) hex dump (first 32 bytes): 00 c0 0a 02 02 00 ff ff 00 80 5a 20 00 80 ff ff ..........Z .... 00 50 00 00 00 00 00 00 02 00 00 00 00 00 00 00 .P.............. backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000007b360995>] __kmalloc_node+0xac/0xd4 [<0000000050ae8904>] mempool_init_node+0x64/0xd8 [<00000000e714c59a>] mempool_init+0x14/0x1c [<000000001719fe70>] bioset_init+0x188/0x22c [<000000004a1ea042>] bch2_fs_io_init+0x2c/0x124 [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 unreferenced object 0xffff0002020ac000 (size 448): comm "mount", pid 3049, jiffies 4294924391 (age 3938.108s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 c8 02 00 02 02 00 ff ff ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c [<000000002d6118f3>] mempool_init_node+0x94/0xd8 [<00000000e714c59a>] mempool_init+0x14/0x1c [<000000001719fe70>] bioset_init+0x188/0x22c [<000000004a1ea042>] bch2_fs_io_init+0x2c/0x124 [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 unreferenced object 0xffff0000c8d1e500 (size 128): comm "mount", pid 3049, jiffies 4294924391 (age 3938.108s) hex dump (first 32 bytes): 00 40 e4 c9 00 00 ff ff c0 d9 9a 03 00 fc ff ff .@.............. 80 d9 9a 03 00 fc ff ff 40 d9 9a 03 00 fc ff ff ........@....... backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000007b360995>] __kmalloc_node+0xac/0xd4 [<0000000050ae8904>] mempool_init_node+0x64/0xd8 [<00000000e714c59a>] mempool_init+0x14/0x1c [<000000002f5588b4>] biovec_init_pool+0x24/0x2c [<00000000a2b87494>] bioset_init+0x208/0x22c [<000000004a1ea042>] bch2_fs_io_init+0x2c/0x124 [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 unreferenced object 0xffff0000c8d1e900 (size 128): comm "mount", pid 3049, jiffies 4294924391 (age 3938.116s) hex dump (first 32 bytes): 00 c2 0a 02 02 00 ff ff 00 00 58 20 00 80 ff ff ..........X .... 00 50 00 00 00 00 00 00 02 00 00 00 00 00 00 00 .P.............. backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000007b360995>] __kmalloc_node+0xac/0xd4 [<0000000050ae8904>] mempool_init_node+0x64/0xd8 [<00000000e714c59a>] mempool_init+0x14/0x1c [<000000001719fe70>] bioset_init+0x188/0x22c [<000000007af2eb34>] bch2_fs_io_init+0x48/0x124 [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 unreferenced object 0xffff0002020ac200 (size 448): comm "mount", pid 3049, jiffies 4294924391 (age 3938.116s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 0d 69 37 bf .............i7. 00 00 00 00 00 00 00 00 c8 c2 0a 02 02 00 ff ff ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c [<000000002d6118f3>] mempool_init_node+0x94/0xd8 [<00000000e714c59a>] mempool_init+0x14/0x1c [<000000001719fe70>] bioset_init+0x188/0x22c [<000000007af2eb34>] bch2_fs_io_init+0x48/0x124 [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 unreferenced object 0xffff0000c8d1e980 (size 128): comm "mount", pid 3049, jiffies 4294924391 (age 3938.116s) hex dump (first 32 bytes): 00 50 e4 c9 00 00 ff ff c0 dc 9a 03 00 fc ff ff .P.............. 80 dc 9a 03 00 fc ff ff 40 44 23 03 00 fc ff ff ........@D#..... backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000007b360995>] __kmalloc_node+0xac/0xd4 [<0000000050ae8904>] mempool_init_node+0x64/0xd8 [<00000000e714c59a>] mempool_init+0x14/0x1c [<000000002f5588b4>] biovec_init_pool+0x24/0x2c [<00000000a2b87494>] bioset_init+0x208/0x22c [<000000007af2eb34>] bch2_fs_io_init+0x48/0x124 [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 unreferenced object 0xffff000230a31e00 (size 512): comm "mount", pid 3049, jiffies 4294924391 (age 3938.120s) hex dump (first 32 bytes): 00 9f 39 08 00 fc ff ff 40 f9 99 08 00 fc ff ff ..9.....@....... 40 c0 8b 08 00 fc ff ff 00 d8 14 08 00 fc ff ff @............... backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000007b360995>] __kmalloc_node+0xac/0xd4 [<0000000050ae8904>] mempool_init_node+0x64/0xd8 [<00000000e714c59a>] mempool_init+0x14/0x1c [<000000009f58f780>] bch2_fs_io_init+0x9c/0x124 [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 unreferenced object 0xffff000200f2e800 (size 1024): comm "mount", pid 3049, jiffies 4294924391 (age 3938.120s) hex dump (first 32 bytes): 40 00 00 00 00 00 00 00 89 16 1e cd 00 00 00 00 @............... 10 e8 f2 00 02 00 ff ff 10 e8 f2 00 02 00 ff ff ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000007b360995>] __kmalloc_node+0xac/0xd4 [<0000000066405974>] kvmalloc_node+0x54/0xe4 [<00000000a51f16c9>] bucket_table_alloc.isra.0+0x44/0x120 [<0000000000df2e94>] rhashtable_init+0x148/0x1ac [<00000000347789c6>] bch2_fs_io_init+0xb8/0x124 [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 unreferenced object 0xffff000222f14700 (size 256): comm "mount", pid 3049, jiffies 4294924391 (age 3938.120s) hex dump (first 32 bytes): 68 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 h............... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000007b360995>] __kmalloc_node+0xac/0xd4 [<000000003a6af69a>] crypto_alloc_tfmmem+0x3c/0x70 [<000000006c0841c0>] crypto_create_tfm_node+0x20/0xa0 [<00000000b0aa6a0f>] crypto_alloc_tfm_node+0x94/0xac [<00000000a2421d04>] crypto_alloc_shash+0x20/0x28 [<00000000aeafee8e>] bch2_fs_encryption_init+0x64/0x150 [<0000000002e060b3>] bch2_fs_alloc+0x840/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 unreferenced object 0xffff00020e544100 (size 128): comm "mount", pid 3049, jiffies 4294924391 (age 3938.128s) hex dump (first 32 bytes): 00 00 f0 20 02 00 ff ff 80 04 f0 20 02 00 ff ff ... ....... .... 00 09 f0 20 02 00 ff ff 80 0d f0 20 02 00 ff ff ... ....... .... backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000007b360995>] __kmalloc_node+0xac/0xd4 [<0000000050ae8904>] mempool_init_node+0x64/0xd8 [<00000000e714c59a>] mempool_init+0x14/0x1c [<000000001719fe70>] bioset_init+0x188/0x22c [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130 [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 unreferenced object 0xffff000220f00000 (size 1104): comm "mount", pid 3049, jiffies 4294924391 (age 3938.128s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 98 e7 87 30 02 00 ff ff ...........0.... 22 01 00 00 00 00 ad de 18 00 f0 20 02 00 ff ff ".......... .... backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c [<000000002d6118f3>] mempool_init_node+0x94/0xd8 [<00000000e714c59a>] mempool_init+0x14/0x1c [<000000001719fe70>] bioset_init+0x188/0x22c [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130 [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 unreferenced object 0xffff000220f00480 (size 1104): comm "mount", pid 3049, jiffies 4294924391 (age 3938.128s) hex dump (first 32 bytes): 80 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 01 00 00 00 00 00 00 00 00 00 00 00 9d a2 98 2c ..............., backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c [<000000002d6118f3>] mempool_init_node+0x94/0xd8 [<00000000e714c59a>] mempool_init+0x14/0x1c [<000000001719fe70>] bioset_init+0x188/0x22c [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130 [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 unreferenced object 0xffff000220f00900 (size 1104): comm "mount", pid 3049, jiffies 4294924391 (age 3938.132s) hex dump (first 32 bytes): 22 01 00 00 00 00 ad de 08 09 f0 20 02 00 ff ff ".......... .... 08 09 f0 20 02 00 ff ff b9 17 f0 20 02 00 ff ff ... ....... .... backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c [<000000002d6118f3>] mempool_init_node+0x94/0xd8 [<00000000e714c59a>] mempool_init+0x14/0x1c [<000000001719fe70>] bioset_init+0x188/0x22c [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130 [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 unreferenced object 0xffff000220f00d80 (size 1104): comm "mount", pid 3049, jiffies 4294924391 (age 3938.132s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 00 00 00 00 34 43 3f b1 ............4C?. 00 00 00 00 00 00 00 00 24 00 bb 04 02 28 1e 3b ........$....(.; backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c [<000000002d6118f3>] mempool_init_node+0x94/0xd8 [<00000000e714c59a>] mempool_init+0x14/0x1c [<000000001719fe70>] bioset_init+0x188/0x22c [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130 [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 unreferenced object 0xffff00020e544000 (size 128): comm "mount", pid 3049, jiffies 4294924391 (age 3938.132s) hex dump (first 32 bytes): 00 00 b5 05 02 00 ff ff 00 50 b5 05 02 00 ff ff .........P...... 00 10 b5 05 02 00 ff ff 00 b0 18 09 02 00 ff ff ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000007b360995>] __kmalloc_node+0xac/0xd4 [<0000000050ae8904>] mempool_init_node+0x64/0xd8 [<00000000e714c59a>] mempool_init+0x14/0x1c [<000000002f5588b4>] biovec_init_pool+0x24/0x2c [<00000000a2b87494>] bioset_init+0x208/0x22c [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130 [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 unreferenced object 0xffff00020918b000 (size 4096): comm "mount", pid 3049, jiffies 4294924391 (age 3938.140s) hex dump (first 32 bytes): c0 5b 26 03 00 fc ff ff 00 10 00 00 00 00 00 00 .[&............. 22 01 00 00 00 00 ad de 18 b0 18 09 02 00 ff ff "............... backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c [<000000002d6118f3>] mempool_init_node+0x94/0xd8 [<00000000e714c59a>] mempool_init+0x14/0x1c [<000000002f5588b4>] biovec_init_pool+0x24/0x2c [<00000000a2b87494>] bioset_init+0x208/0x22c [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130 [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 unreferenced object 0xffff000200f2f400 (size 1024): comm "mount", pid 3049, jiffies 4294924391 (age 3938.140s) hex dump (first 32 bytes): 19 00 00 00 00 00 00 00 9d 19 00 00 00 00 00 00 ................ a6 19 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<00000000cd9515c0>] __kmalloc+0xac/0xd4 [<0000000097d0e280>] bch2_blacklist_table_initialize+0x48/0xc4 [<000000007af2f7c0>] bch2_fs_recovery+0x220/0x140c [<00000000835fe5c8>] bch2_fs_start+0x104/0x2ac [<00000000f2c8e79f>] bch2_fs_open+0x2cc/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 [<00000000b4ee996a>] el0_svc+0x74/0x9c unreferenced object 0xffff00020e544800 (size 128): comm "mount", pid 3049, jiffies 4294924391 (age 3938.140s) hex dump (first 32 bytes): 07 00 00 01 09 00 00 00 73 74 61 72 74 69 6e 67 ........starting 20 6a 6f 75 72 6e 61 6c 20 61 74 20 65 6e 74 72 journal at entr backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000005602d414>] __kmalloc_node_track_caller+0xa8/0xd0 [<0000000078a13296>] krealloc+0x7c/0xc4 [<00000000224f82f4>] __darray_make_room.constprop.0+0x5c/0x7c [<00000000caa2f6f2>] __bch2_trans_log_msg+0x80/0x12c [<0000000034a8dfea>] __bch2_fs_log_msg+0x68/0x158 [<00000000cc0719ad>] bch2_journal_log_msg+0x60/0x98 [<00000000a0b3d87b>] bch2_fs_recovery+0x8f0/0x140c [<00000000835fe5c8>] bch2_fs_start+0x104/0x2ac [<00000000f2c8e79f>] bch2_fs_open+0x2cc/0x430 [<00000000e72d508e>] bch2_mount+0x194/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 unreferenced object 0xffff00020f2d8398 (size 184): comm "mount", pid 3049, jiffies 4294924395 (age 3938.128s) hex dump (first 32 bytes): 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c [<0000000059ea6346>] bch2_btree_path_traverse_cached_slowpath+0x240/0x9d0 [<00000000b340fce9>] bch2_btree_path_traverse_cached+0x7c/0x184 [<000000006b501914>] bch2_btree_path_traverse_one+0xbc/0x4f0 [<0000000046611bb8>] bch2_btree_path_traverse+0x20/0x30 [<00000000cb7378ca>] bch2_btree_iter_peek_slot+0xe4/0x3b0 [<000000005b36d96f>] __bch2_bkey_get_iter.constprop.0+0x40/0x74 [<00000000db6c00c7>] bch2_inode_peek+0x80/0xfc [<00000000d48fafeb>] bch2_inode_find_by_inum_trans+0x34/0x74 [<00000000d94a8ca3>] bch2_vfs_inode_get+0xdc/0x1a0 [<00000000b7cffdf2>] bch2_mount+0x3bc/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 [<00000000527b4561>] path_mount+0x5d0/0x6c8 [<00000000dc643d96>] do_mount+0x80/0xa4 unreferenced object 0xffff000222f15e00 (size 256): comm "mount", pid 3049, jiffies 4294924395 (age 3938.128s) hex dump (first 32 bytes): 12 81 1d 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 ff ff ff ff 00 10 00 00 00 00 00 00 ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<00000000cd9515c0>] __kmalloc+0xac/0xd4 [<0000000080dcf5d4>] btree_key_cache_fill+0x190/0x338 [<00000000142a161b>] bch2_btree_path_traverse_cached_slowpath+0x8d8/0x9d0 [<00000000b340fce9>] bch2_btree_path_traverse_cached+0x7c/0x184 [<000000006b501914>] bch2_btree_path_traverse_one+0xbc/0x4f0 [<0000000046611bb8>] bch2_btree_path_traverse+0x20/0x30 [<00000000cb7378ca>] bch2_btree_iter_peek_slot+0xe4/0x3b0 [<000000005b36d96f>] __bch2_bkey_get_iter.constprop.0+0x40/0x74 [<00000000db6c00c7>] bch2_inode_peek+0x80/0xfc [<00000000d48fafeb>] bch2_inode_find_by_inum_trans+0x34/0x74 [<00000000d94a8ca3>] bch2_vfs_inode_get+0xdc/0x1a0 [<00000000b7cffdf2>] bch2_mount+0x3bc/0x45c [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 unreferenced object 0xffff00020ab0da80 (size 128): comm "fio", pid 3068, jiffies 4294924399 (age 3938.112s) hex dump (first 32 bytes): 70 61 74 68 3a 20 69 64 78 20 20 30 20 72 65 66 path: idx 0 ref 20 30 3a 30 20 50 20 53 20 62 74 72 65 65 3d 73 0:0 P S btree=s backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000005602d414>] __kmalloc_node_track_caller+0xa8/0xd0 [<0000000078a13296>] krealloc+0x7c/0xc4 [<00000000ac6de278>] bch2_printbuf_make_room+0x6c/0x9c [<00000000e73dab89>] bch2_prt_printf+0xac/0x104 [<00000000ef2c8dc5>] bch2_btree_path_to_text+0x6c/0xb8 [<00000000eab3e43c>] __bch2_trans_paths_to_text+0x60/0x64 [<00000000d843d03a>] bch2_trans_paths_to_text+0x10/0x18 [<00000000fbe77c9c>] bch2_trans_update_max_paths+0x6c/0x104 [<00000000715f184d>] btree_path_alloc+0x44/0x140 [<0000000028aac82e>] bch2_path_get+0x190/0x210 [<000000001fbd1416>] bch2_trans_iter_init_outlined+0xd4/0x100 [<00000000b7c2c8e8>] bch2_trans_iter_init.constprop.0+0x28/0x30 [<000000005ee45b0d>] __bch2_dirent_lookup_trans+0xc4/0x20c [<00000000bf9849b2>] bch2_dirent_lookup+0x9c/0x10c unreferenced object 0xffff00020f2d8450 (size 184): comm "fio", pid 3068, jiffies 4294924400 (age 3938.112s) hex dump (first 32 bytes): 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c [<0000000059ea6346>] bch2_btree_path_traverse_cached_slowpath+0x240/0x9d0 [<00000000b340fce9>] bch2_btree_path_traverse_cached+0x7c/0x184 [<000000006b501914>] bch2_btree_path_traverse_one+0xbc/0x4f0 [<0000000046611bb8>] bch2_btree_path_traverse+0x20/0x30 [<00000000cb7378ca>] bch2_btree_iter_peek_slot+0xe4/0x3b0 [<000000005b36d96f>] __bch2_bkey_get_iter.constprop.0+0x40/0x74 [<00000000db6c00c7>] bch2_inode_peek+0x80/0xfc [<00000000d48fafeb>] bch2_inode_find_by_inum_trans+0x34/0x74 [<00000000d94a8ca3>] bch2_vfs_inode_get+0xdc/0x1a0 [<00000000225a6085>] bch2_lookup+0x7c/0xb8 [<0000000059304a98>] __lookup_slow+0xd4/0x114 [<000000001225c82d>] walk_component+0x98/0xd4 [<0000000095114e46>] path_lookupat+0x84/0x114 [<000000002ee74fa2>] filename_lookup+0x54/0xc4 unreferenced object 0xffff000222f15a00 (size 256): comm "fio", pid 3068, jiffies 4294924400 (age 3938.112s) hex dump (first 32 bytes): 13 81 1d 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 ff ff ff ff f3 15 00 30 00 00 00 00 ...........0.... backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<00000000cd9515c0>] __kmalloc+0xac/0xd4 [<0000000080dcf5d4>] btree_key_cache_fill+0x190/0x338 [<00000000142a161b>] bch2_btree_path_traverse_cached_slowpath+0x8d8/0x9d0 [<00000000b340fce9>] bch2_btree_path_traverse_cached+0x7c/0x184 [<000000006b501914>] bch2_btree_path_traverse_one+0xbc/0x4f0 [<0000000046611bb8>] bch2_btree_path_traverse+0x20/0x30 [<00000000cb7378ca>] bch2_btree_iter_peek_slot+0xe4/0x3b0 [<000000005b36d96f>] __bch2_bkey_get_iter.constprop.0+0x40/0x74 [<00000000db6c00c7>] bch2_inode_peek+0x80/0xfc [<00000000d48fafeb>] bch2_inode_find_by_inum_trans+0x34/0x74 [<00000000d94a8ca3>] bch2_vfs_inode_get+0xdc/0x1a0 [<00000000225a6085>] bch2_lookup+0x7c/0xb8 [<0000000059304a98>] __lookup_slow+0xd4/0x114 [<000000001225c82d>] walk_component+0x98/0xd4 unreferenced object 0xffff0002058e0e00 (size 256): comm "fio", pid 3081, jiffies 4294924461 (age 3937.868s) hex dump (first 32 bytes): 70 61 74 68 3a 20 69 64 78 20 20 31 20 72 65 66 path: idx 1 ref 20 30 3a 30 20 50 20 20 20 62 74 72 65 65 3d 65 0:0 P btree=e backtrace: [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 [<000000005602d414>] __kmalloc_node_track_caller+0xa8/0xd0 [<0000000078a13296>] krealloc+0x7c/0xc4 [<00000000ac6de278>] bch2_printbuf_make_room+0x6c/0x9c [<00000000e73dab89>] bch2_prt_printf+0xac/0x104 [<00000000ef2c8dc5>] bch2_btree_path_to_text+0x6c/0xb8 [<00000000eab3e43c>] __bch2_trans_paths_to_text+0x60/0x64 [<00000000d843d03a>] bch2_trans_paths_to_text+0x10/0x18 [<00000000fbe77c9c>] bch2_trans_update_max_paths+0x6c/0x104 [<00000000af279ad9>] __bch2_btree_path_make_mut+0x64/0x1d0 [<00000000b6ea382b>] __bch2_btree_path_set_pos+0x5c/0x1f4 [<000000001f2292b9>] bch2_btree_path_set_pos+0x68/0x78 [<0000000046402275>] bch2_btree_iter_peek_slot+0xd0/0x3b0 [<00000000a578d851>] bchfs_read.isra.0+0x128/0x77c [<00000000d544d588>] bch2_readahead+0x1a0/0x264
On 6/28/23 4:13?PM, Kent Overstreet wrote: > On Wed, Jun 28, 2023 at 03:17:43PM -0600, Jens Axboe wrote: >> Case in point, just changed my reproducer to use aio instead of >> io_uring. Here's the full script: >> >> #!/bin/bash >> >> DEV=/dev/nvme1n1 >> MNT=/data >> ITER=0 >> >> while true; do >> echo loop $ITER >> sudo mount $DEV $MNT >> fio --name=test --ioengine=aio --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --output=/dev/null & >> Y=$(($RANDOM % 3)) >> X=$(($RANDOM % 10)) >> VAL="$Y.$X" >> sleep $VAL >> ps -e | grep fio > /dev/null 2>&1 >> while [ $? -eq 0 ]; do >> killall -9 fio > /dev/null 2>&1 >> echo will wait >> wait > /dev/null 2>&1 >> echo done waiting >> ps -e | grep "fio " > /dev/null 2>&1 >> done >> sudo umount /data >> if [ $? -ne 0 ]; then >> break >> fi >> ((ITER++)) >> done >> >> and if I run that, fails on the first umount attempt in that loop: >> >> axboe@m1max-kvm ~> bash test2.sh >> loop 0 >> will wait >> done waiting >> umount: /data: target is busy. >> >> So yeah, this is _nothing_ new. I really don't think trying to address >> this in the kernel is the right approach, it'd be a lot saner to harden >> the xfstest side to deal with the umount a bit more sanely. There are >> obviously tons of other ways that a mount could get pinned, which isn't >> too relevant here since the bdev and mount point are basically exclusive >> to the test being run. But the kill and delayed fput is enough to make >> that case imho. > > Uh, count me very much not in favor of hacking around bugs elsewhere. > > Al, do you know if this has been considered before? We've got fput() > being called from aio completion, which often runs out of a worqueue (if > not a workqueue, a bottom half of some sort - what happens then, I > wonder) - so the effect is that it goes on the global list, not the task > work list. > > hence, kill -9ing a process doing aio (or io_uring io, for extra > reasons) causes umount to fail with -EBUSY. > > and since there's no mechanism for userspace to deal with this besides > sleep and retry, this seems pretty gross. But there is, as Christian outlined. I would not call it pretty or intuitive, but you can in fact make it work just fine and not just for the deferred fput() case but also in the presence of other kinds of pins. Of which there are of course many. > I'd be willing to tackle this for aio since I know that code... But it's not aio (or io_uring or whatever), it's simply the fact that doing an fput() from an exiting task (for example) will end up being done async. And hence waiting for task exits is NOT enough to ensure that all file references have been released. Since there are a variety of other reasons why a mount may be pinned and fail to umount, perhaps it's worth considering that changing this behavior won't buy us that much. Especially since it's been around for more than 10 years: commit 4a9d4b024a3102fc083c925c242d98ac27b1c5f6 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Sun Jun 24 09:56:45 2012 +0400 switch fput to task_work_add though that commit message goes on to read: We are guaranteed that __fput() will be done before we return to userland (or exit). Note that for fput() from a kernel thread we get an async behaviour; it's almost always OK, but sometimes you might need to have __fput() completed before you do anything else. There are two mechanisms for that - a general barrier (flush_delayed_fput()) and explicit __fput_sync(). Both should be used with care (as was the case for fput() from kernel threads all along). See comments in fs/file_table.c for details. where that first sentence isn't true if the task is indeed exiting. I guess you can say that it is as it doesn't return to userland, but splitting hairs. Though the commit in question doesn't seem to handle that case, but assuming that came in with a later fixup. It is true if the task_work gets added, as that will get run before returning to userspace. If a case were to be made that we also guarantee that fput has been done by the time to task returns to userspace, or exits, then we'd probably want to move that deferred fput list to the task_struct and ensure that it gets run if the task exits rather than have a global deferred list. Currently we have: 1) If kthread or in interrupt 1a) add to global fput list 2) task_work_add if not. If that fails, goto 1a. which would then become: 1) If kthread or in interrupt 1a) add to global fput list 2) task_work_add if not. If that fails, we know task is existing. add to per-task defer list to be run at a convenient time before task has exited. and seems a lot saner than hacking around this in umount specifically.
On Wed, Jun 28, 2023 at 04:33:55PM -0600, Jens Axboe wrote: > On 6/28/23 4:13?PM, Kent Overstreet wrote: > > On Wed, Jun 28, 2023 at 03:17:43PM -0600, Jens Axboe wrote: > >> Case in point, just changed my reproducer to use aio instead of > >> io_uring. Here's the full script: > >> > >> #!/bin/bash > >> > >> DEV=/dev/nvme1n1 > >> MNT=/data > >> ITER=0 > >> > >> while true; do > >> echo loop $ITER > >> sudo mount $DEV $MNT > >> fio --name=test --ioengine=aio --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --output=/dev/null & > >> Y=$(($RANDOM % 3)) > >> X=$(($RANDOM % 10)) > >> VAL="$Y.$X" > >> sleep $VAL > >> ps -e | grep fio > /dev/null 2>&1 > >> while [ $? -eq 0 ]; do > >> killall -9 fio > /dev/null 2>&1 > >> echo will wait > >> wait > /dev/null 2>&1 > >> echo done waiting > >> ps -e | grep "fio " > /dev/null 2>&1 > >> done > >> sudo umount /data > >> if [ $? -ne 0 ]; then > >> break > >> fi > >> ((ITER++)) > >> done > >> > >> and if I run that, fails on the first umount attempt in that loop: > >> > >> axboe@m1max-kvm ~> bash test2.sh > >> loop 0 > >> will wait > >> done waiting > >> umount: /data: target is busy. > >> > >> So yeah, this is _nothing_ new. I really don't think trying to address > >> this in the kernel is the right approach, it'd be a lot saner to harden > >> the xfstest side to deal with the umount a bit more sanely. There are > >> obviously tons of other ways that a mount could get pinned, which isn't > >> too relevant here since the bdev and mount point are basically exclusive > >> to the test being run. But the kill and delayed fput is enough to make > >> that case imho. > > > > Uh, count me very much not in favor of hacking around bugs elsewhere. > > > > Al, do you know if this has been considered before? We've got fput() > > being called from aio completion, which often runs out of a worqueue (if > > not a workqueue, a bottom half of some sort - what happens then, I > > wonder) - so the effect is that it goes on the global list, not the task > > work list. > > > > hence, kill -9ing a process doing aio (or io_uring io, for extra > > reasons) causes umount to fail with -EBUSY. > > > > and since there's no mechanism for userspace to deal with this besides > > sleep and retry, this seems pretty gross. > > But there is, as Christian outlined. I would not call it pretty or > intuitive, but you can in fact make it work just fine and not just for > the deferred fput() case but also in the presence of other kinds of > pins. Of which there are of course many. No, because as I explained that just defers the race until when you next try to use the device, since with lazy umount the device will still be use when umount returns. What you'd want is a lazy, synchronous umount, and AFAIK that doesn't exist. > > I'd be willing to tackle this for aio since I know that code... > > But it's not aio (or io_uring or whatever), it's simply the fact that > doing an fput() from an exiting task (for example) will end up being > done async. And hence waiting for task exits is NOT enough to ensure > that all file references have been released. > > Since there are a variety of other reasons why a mount may be pinned and > fail to umount, perhaps it's worth considering that changing this > behavior won't buy us that much. Especially since it's been around for > more than 10 years: Because it seems that before io_uring the race was quite a bit harder to hit - I only started seeing it when things started switching over to io_uring. generic/388 used to pass reliably for me (pre backpointers), now it doesn't. > commit 4a9d4b024a3102fc083c925c242d98ac27b1c5f6 > Author: Al Viro <viro@zeniv.linux.org.uk> > Date: Sun Jun 24 09:56:45 2012 +0400 > > switch fput to task_work_add > > though that commit message goes on to read: > > We are guaranteed that __fput() will be done before we return > to userland (or exit). Note that for fput() from a kernel > thread we get an async behaviour; it's almost always OK, but > sometimes you might need to have __fput() completed before > you do anything else. There are two mechanisms for that - > a general barrier (flush_delayed_fput()) and explicit > __fput_sync(). Both should be used with care (as was the > case for fput() from kernel threads all along). See comments > in fs/file_table.c for details. > > where that first sentence isn't true if the task is indeed exiting. I > guess you can say that it is as it doesn't return to userland, but > splitting hairs. Though the commit in question doesn't seem to handle > that case, but assuming that came in with a later fixup. > > It is true if the task_work gets added, as that will get run before > returning to userspace. Yes, AIO seems to very much be the exceptional case that wasn't originally considered. > If a case were to be made that we also guarantee that fput has been done > by the time to task returns to userspace, or exits, And that does seem to be the intent of the original code, no? > then we'd probably want to move that deferred fput list to the > task_struct and ensure that it gets run if the task exits rather than > have a global deferred list. Currently we have: > > > 1) If kthread or in interrupt > 1a) add to global fput list > 2) task_work_add if not. If that fails, goto 1a. > > which would then become: > > 1) If kthread or in interrupt > 1a) add to global fput list > 2) task_work_add if not. If that fails, we know task is existing. add to > per-task defer list to be run at a convenient time before task has > exited. no, it becomes: if we're running in a user task, or if we're doing an operation on behalf of a user task, add to the user task's deferred list: otherwise add to global deferred list.
On Wed, Jun 28, 2023 at 04:14:44PM -0600, Jens Axboe wrote: > Got a whole bunch more running that aio reproducer I sent earlier. I'm > sure a lot of these are dupes, sending them here for completeness. Are you running 'echo scan > /sys/kernel/debug/kmemleak' while the test is running? I see a lot of spurious leaks when I do that that go away if I scan after everything's shut down. > > [ 677.739815] kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > [ 1283.963249] kmemleak: 37 new suspected memory leaks (see /sys/kernel/debug/kmemleak) > > unreferenced object 0xffff0000e35de000 (size 8192): > comm "mount", pid 3049, jiffies 4294924385 (age 3938.092s) > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 1d 00 1d 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000005602d414>] __kmalloc_node_track_caller+0xa8/0xd0 > [<0000000078a13296>] krealloc+0x7c/0xc4 > [<00000000f1fea4ad>] bch2_sb_realloc+0x12c/0x150 > [<00000000f03d5ce6>] __copy_super+0x104/0x17c > [<000000005567521f>] bch2_sb_to_fs+0x3c/0x80 > [<0000000062d4e9f6>] bch2_fs_alloc+0x410/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > unreferenced object 0xffff00020a209900 (size 128): > comm "mount", pid 3049, jiffies 4294924385 (age 3938.092s) > hex dump (first 32 bytes): > 03 01 01 00 02 01 01 00 04 01 01 00 00 00 00 00 ................ > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<00000000cd9515c0>] __kmalloc+0xac/0xd4 > [<00000000fcf82258>] kmalloc_array.constprop.0+0x18/0x20 > [<00000000182c3be4>] __bch2_sb_replicas_v0_to_cpu_replicas+0x50/0x118 > [<0000000012583a94>] bch2_sb_replicas_to_cpu_replicas+0xb0/0xc0 > [<00000000fcd0b373>] bch2_sb_to_fs+0x4c/0x80 > [<0000000062d4e9f6>] bch2_fs_alloc+0x410/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > unreferenced object 0xffff000206785400 (size 128): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.068s) > hex dump (first 32 bytes): > 00 00 d9 20 02 00 ff ff 01 00 00 00 01 04 00 00 ... ............ > 01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00 ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000007b360995>] __kmalloc_node+0xac/0xd4 > [<0000000050ae8904>] mempool_init_node+0x64/0xd8 > [<00000000e714c59a>] mempool_init+0x14/0x1c > [<00000000bb95f8a0>] bch2_fs_alloc+0x690/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 > [<00000000b4ee996a>] el0_svc+0x74/0x9c > unreferenced object 0xffff000206785700 (size 128): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.076s) > hex dump (first 32 bytes): > 00 00 96 2d 02 00 ff ff 01 00 00 00 01 04 00 00 ...-............ > 01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00 ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000007b360995>] __kmalloc_node+0xac/0xd4 > [<0000000050ae8904>] mempool_init_node+0x64/0xd8 > [<00000000e714c59a>] mempool_init+0x14/0x1c > [<0000000089ab54c3>] bch2_fs_replicas_init+0x64/0xac > [<0000000056c4a5fe>] bch2_fs_alloc+0x79c/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 > unreferenced object 0xffff000206785600 (size 128): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.076s) > hex dump (first 32 bytes): > 00 1a 05 00 00 00 00 00 00 0c 02 00 00 00 00 00 ................ > 42 9c ba 00 00 00 00 00 00 00 00 00 00 00 00 00 B............... > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<00000000cd9515c0>] __kmalloc+0xac/0xd4 > [<00000000f949dcc7>] replicas_table_update+0x84/0x214 > [<000000002debc89d>] bch2_fs_replicas_init+0x74/0xac > [<0000000056c4a5fe>] bch2_fs_alloc+0x79c/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 > [<00000000b4ee996a>] el0_svc+0x74/0x9c > unreferenced object 0xffff000206785580 (size 128): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.076s) > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 01 00 00 00 01 04 00 00 ................ > 01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00 ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<00000000cd9515c0>] __kmalloc+0xac/0xd4 > [<00000000639b7f33>] replicas_table_update+0x98/0x214 > [<000000002debc89d>] bch2_fs_replicas_init+0x74/0xac > [<0000000056c4a5fe>] bch2_fs_alloc+0x79c/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 > [<00000000b4ee996a>] el0_svc+0x74/0x9c > unreferenced object 0xffff000206785080 (size 128): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.088s) > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<00000000cd9515c0>] __kmalloc+0xac/0xd4 > [<000000001335974a>] __prealloc_shrinker+0x3c/0x60 > [<0000000017b0bc26>] register_shrinker+0x14/0x34 > [<00000000c07d01d7>] bch2_fs_btree_cache_init+0xf8/0x150 > [<000000004b948640>] bch2_fs_alloc+0x7ac/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 > unreferenced object 0xffff000200f2ec00 (size 1024): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.088s) > hex dump (first 32 bytes): > 40 00 00 00 00 00 00 00 a8 66 18 09 00 00 00 00 @........f...... > 10 ec f2 00 02 00 ff ff 10 ec f2 00 02 00 ff ff ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000007b360995>] __kmalloc_node+0xac/0xd4 > [<0000000066405974>] kvmalloc_node+0x54/0xe4 > [<00000000a51f16c9>] bucket_table_alloc.isra.0+0x44/0x120 > [<0000000000df2e94>] rhashtable_init+0x148/0x1ac > [<0000000080f397f7>] bch2_fs_btree_key_cache_init+0x48/0x90 > [<0000000089e6783c>] bch2_fs_alloc+0x7c0/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > unreferenced object 0xffff000206785b80 (size 128): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.088s) > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<00000000cd9515c0>] __kmalloc+0xac/0xd4 > [<000000001335974a>] __prealloc_shrinker+0x3c/0x60 > [<0000000017b0bc26>] register_shrinker+0x14/0x34 > [<00000000228dd43a>] bch2_fs_btree_key_cache_init+0x88/0x90 > [<0000000089e6783c>] bch2_fs_alloc+0x7c0/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 > unreferenced object 0xffff000206785500 (size 128): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.096s) > hex dump (first 32 bytes): > 00 00 20 2b 02 00 ff ff 01 00 00 00 01 04 00 00 .. +............ > 01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00 ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000007b360995>] __kmalloc_node+0xac/0xd4 > [<0000000050ae8904>] mempool_init_node+0x64/0xd8 > [<00000000e714c59a>] mempool_init+0x14/0x1c > [<00000000fc134979>] bch2_fs_btree_iter_init+0x98/0x130 > [<00000000addf57f5>] bch2_fs_alloc+0x7d0/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 > unreferenced object 0xffff000206785480 (size 128): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.096s) > hex dump (first 32 bytes): > 00 00 97 05 02 00 ff ff 01 00 00 00 01 04 00 00 ................ > 01 04 00 00 01 04 00 00 01 04 00 00 01 04 00 00 ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000007b360995>] __kmalloc_node+0xac/0xd4 > [<0000000050ae8904>] mempool_init_node+0x64/0xd8 > [<00000000e714c59a>] mempool_init+0x14/0x1c > [<000000004d03e2b7>] bch2_fs_btree_iter_init+0xb8/0x130 > [<00000000addf57f5>] bch2_fs_alloc+0x7d0/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 > unreferenced object 0xffff000230a31a00 (size 512): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.096s) > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000009502ae7b>] kmalloc_trace+0x38/0x78 > [<0000000060cbc45a>] init_srcu_struct_fields+0x38/0x284 > [<00000000643a7c95>] init_srcu_struct+0x10/0x18 > [<00000000c46c2041>] bch2_fs_btree_iter_init+0xc8/0x130 > [<00000000addf57f5>] bch2_fs_alloc+0x7d0/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 > unreferenced object 0xffff000222f14f00 (size 256): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.100s) > hex dump (first 32 bytes): > 03 00 00 00 01 00 ff ff 1a cf e0 f2 17 b2 a8 24 ...............$ > cf 4a ba c3 fb 05 19 cd f6 4d f5 45 e7 e8 29 eb .J.......M.E..). > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000007b360995>] __kmalloc_node+0xac/0xd4 > [<0000000066405974>] kvmalloc_node+0x54/0xe4 > [<00000000c83b22ef>] bch2_fs_buckets_waiting_for_journal_init+0x44/0x6c > [<0000000026230712>] bch2_fs_alloc+0x7f0/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 > [<00000000b4ee996a>] el0_svc+0x74/0x9c > unreferenced object 0xffff000230e00000 (size 720896): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.100s) > hex dump (first 32 bytes): > 40 71 26 38 00 00 00 00 00 00 00 00 00 00 00 00 @q&8............ > d2 17 f9 2f 75 7e 51 2a 01 01 00 00 14 00 00 00 .../u~Q*........ > backtrace: > [<00000000c6d9e620>] __kmalloc_large_node+0x134/0x164 > [<000000009024f86b>] __kmalloc_node+0x34/0xd4 > [<0000000066405974>] kvmalloc_node+0x54/0xe4 > [<00000000729eb36b>] bch2_fs_btree_write_buffer_init+0x58/0xb4 > [<000000003e35ba10>] bch2_fs_alloc+0x800/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 > [<00000000b4ee996a>] el0_svc+0x74/0x9c > [<00000000a22b66b5>] el0t_64_sync_handler+0xa8/0x134 > unreferenced object 0xffff000230900000 (size 720896): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.100s) > hex dump (first 32 bytes): > 88 96 28 f7 00 00 00 00 00 00 00 00 00 00 00 00 ..(............. > d2 17 f9 2f 75 7e 51 2a 01 01 00 00 13 00 00 00 .../u~Q*........ > backtrace: > [<00000000c6d9e620>] __kmalloc_large_node+0x134/0x164 > [<000000009024f86b>] __kmalloc_node+0x34/0xd4 > [<0000000066405974>] kvmalloc_node+0x54/0xe4 > [<00000000f27707f5>] bch2_fs_btree_write_buffer_init+0x7c/0xb4 > [<000000003e35ba10>] bch2_fs_alloc+0x800/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 > [<00000000b4ee996a>] el0_svc+0x74/0x9c > [<00000000a22b66b5>] el0t_64_sync_handler+0xa8/0x134 > unreferenced object 0xffff0000c8d1e300 (size 128): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.108s) > hex dump (first 32 bytes): > 00 c0 0a 02 02 00 ff ff 00 80 5a 20 00 80 ff ff ..........Z .... > 00 50 00 00 00 00 00 00 02 00 00 00 00 00 00 00 .P.............. > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000007b360995>] __kmalloc_node+0xac/0xd4 > [<0000000050ae8904>] mempool_init_node+0x64/0xd8 > [<00000000e714c59a>] mempool_init+0x14/0x1c > [<000000001719fe70>] bioset_init+0x188/0x22c > [<000000004a1ea042>] bch2_fs_io_init+0x2c/0x124 > [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > unreferenced object 0xffff0002020ac000 (size 448): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.108s) > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ > 00 00 00 00 00 00 00 00 c8 02 00 02 02 00 ff ff ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c > [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c > [<000000002d6118f3>] mempool_init_node+0x94/0xd8 > [<00000000e714c59a>] mempool_init+0x14/0x1c > [<000000001719fe70>] bioset_init+0x188/0x22c > [<000000004a1ea042>] bch2_fs_io_init+0x2c/0x124 > [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > unreferenced object 0xffff0000c8d1e500 (size 128): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.108s) > hex dump (first 32 bytes): > 00 40 e4 c9 00 00 ff ff c0 d9 9a 03 00 fc ff ff .@.............. > 80 d9 9a 03 00 fc ff ff 40 d9 9a 03 00 fc ff ff ........@....... > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000007b360995>] __kmalloc_node+0xac/0xd4 > [<0000000050ae8904>] mempool_init_node+0x64/0xd8 > [<00000000e714c59a>] mempool_init+0x14/0x1c > [<000000002f5588b4>] biovec_init_pool+0x24/0x2c > [<00000000a2b87494>] bioset_init+0x208/0x22c > [<000000004a1ea042>] bch2_fs_io_init+0x2c/0x124 > [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > unreferenced object 0xffff0000c8d1e900 (size 128): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.116s) > hex dump (first 32 bytes): > 00 c2 0a 02 02 00 ff ff 00 00 58 20 00 80 ff ff ..........X .... > 00 50 00 00 00 00 00 00 02 00 00 00 00 00 00 00 .P.............. > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000007b360995>] __kmalloc_node+0xac/0xd4 > [<0000000050ae8904>] mempool_init_node+0x64/0xd8 > [<00000000e714c59a>] mempool_init+0x14/0x1c > [<000000001719fe70>] bioset_init+0x188/0x22c > [<000000007af2eb34>] bch2_fs_io_init+0x48/0x124 > [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > unreferenced object 0xffff0002020ac200 (size 448): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.116s) > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 0d 69 37 bf .............i7. > 00 00 00 00 00 00 00 00 c8 c2 0a 02 02 00 ff ff ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c > [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c > [<000000002d6118f3>] mempool_init_node+0x94/0xd8 > [<00000000e714c59a>] mempool_init+0x14/0x1c > [<000000001719fe70>] bioset_init+0x188/0x22c > [<000000007af2eb34>] bch2_fs_io_init+0x48/0x124 > [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > unreferenced object 0xffff0000c8d1e980 (size 128): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.116s) > hex dump (first 32 bytes): > 00 50 e4 c9 00 00 ff ff c0 dc 9a 03 00 fc ff ff .P.............. > 80 dc 9a 03 00 fc ff ff 40 44 23 03 00 fc ff ff ........@D#..... > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000007b360995>] __kmalloc_node+0xac/0xd4 > [<0000000050ae8904>] mempool_init_node+0x64/0xd8 > [<00000000e714c59a>] mempool_init+0x14/0x1c > [<000000002f5588b4>] biovec_init_pool+0x24/0x2c > [<00000000a2b87494>] bioset_init+0x208/0x22c > [<000000007af2eb34>] bch2_fs_io_init+0x48/0x124 > [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > unreferenced object 0xffff000230a31e00 (size 512): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.120s) > hex dump (first 32 bytes): > 00 9f 39 08 00 fc ff ff 40 f9 99 08 00 fc ff ff ..9.....@....... > 40 c0 8b 08 00 fc ff ff 00 d8 14 08 00 fc ff ff @............... > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000007b360995>] __kmalloc_node+0xac/0xd4 > [<0000000050ae8904>] mempool_init_node+0x64/0xd8 > [<00000000e714c59a>] mempool_init+0x14/0x1c > [<000000009f58f780>] bch2_fs_io_init+0x9c/0x124 > [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 > unreferenced object 0xffff000200f2e800 (size 1024): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.120s) > hex dump (first 32 bytes): > 40 00 00 00 00 00 00 00 89 16 1e cd 00 00 00 00 @............... > 10 e8 f2 00 02 00 ff ff 10 e8 f2 00 02 00 ff ff ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000007b360995>] __kmalloc_node+0xac/0xd4 > [<0000000066405974>] kvmalloc_node+0x54/0xe4 > [<00000000a51f16c9>] bucket_table_alloc.isra.0+0x44/0x120 > [<0000000000df2e94>] rhashtable_init+0x148/0x1ac > [<00000000347789c6>] bch2_fs_io_init+0xb8/0x124 > [<000000005ef642fb>] bch2_fs_alloc+0x820/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > unreferenced object 0xffff000222f14700 (size 256): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.120s) > hex dump (first 32 bytes): > 68 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 h............... > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000007b360995>] __kmalloc_node+0xac/0xd4 > [<000000003a6af69a>] crypto_alloc_tfmmem+0x3c/0x70 > [<000000006c0841c0>] crypto_create_tfm_node+0x20/0xa0 > [<00000000b0aa6a0f>] crypto_alloc_tfm_node+0x94/0xac > [<00000000a2421d04>] crypto_alloc_shash+0x20/0x28 > [<00000000aeafee8e>] bch2_fs_encryption_init+0x64/0x150 > [<0000000002e060b3>] bch2_fs_alloc+0x840/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > unreferenced object 0xffff00020e544100 (size 128): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.128s) > hex dump (first 32 bytes): > 00 00 f0 20 02 00 ff ff 80 04 f0 20 02 00 ff ff ... ....... .... > 00 09 f0 20 02 00 ff ff 80 0d f0 20 02 00 ff ff ... ....... .... > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000007b360995>] __kmalloc_node+0xac/0xd4 > [<0000000050ae8904>] mempool_init_node+0x64/0xd8 > [<00000000e714c59a>] mempool_init+0x14/0x1c > [<000000001719fe70>] bioset_init+0x188/0x22c > [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130 > [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > unreferenced object 0xffff000220f00000 (size 1104): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.128s) > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 98 e7 87 30 02 00 ff ff ...........0.... > 22 01 00 00 00 00 ad de 18 00 f0 20 02 00 ff ff ".......... .... > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c > [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c > [<000000002d6118f3>] mempool_init_node+0x94/0xd8 > [<00000000e714c59a>] mempool_init+0x14/0x1c > [<000000001719fe70>] bioset_init+0x188/0x22c > [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130 > [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > unreferenced object 0xffff000220f00480 (size 1104): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.128s) > hex dump (first 32 bytes): > 80 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 01 00 00 00 00 00 00 00 00 00 00 00 9d a2 98 2c ..............., > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c > [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c > [<000000002d6118f3>] mempool_init_node+0x94/0xd8 > [<00000000e714c59a>] mempool_init+0x14/0x1c > [<000000001719fe70>] bioset_init+0x188/0x22c > [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130 > [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > unreferenced object 0xffff000220f00900 (size 1104): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.132s) > hex dump (first 32 bytes): > 22 01 00 00 00 00 ad de 08 09 f0 20 02 00 ff ff ".......... .... > 08 09 f0 20 02 00 ff ff b9 17 f0 20 02 00 ff ff ... ....... .... > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c > [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c > [<000000002d6118f3>] mempool_init_node+0x94/0xd8 > [<00000000e714c59a>] mempool_init+0x14/0x1c > [<000000001719fe70>] bioset_init+0x188/0x22c > [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130 > [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > unreferenced object 0xffff000220f00d80 (size 1104): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.132s) > hex dump (first 32 bytes): > 01 00 00 00 00 00 00 00 00 00 00 00 34 43 3f b1 ............4C?. > 00 00 00 00 00 00 00 00 24 00 bb 04 02 28 1e 3b ........$....(.; > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c > [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c > [<000000002d6118f3>] mempool_init_node+0x94/0xd8 > [<00000000e714c59a>] mempool_init+0x14/0x1c > [<000000001719fe70>] bioset_init+0x188/0x22c > [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130 > [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > unreferenced object 0xffff00020e544000 (size 128): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.132s) > hex dump (first 32 bytes): > 00 00 b5 05 02 00 ff ff 00 50 b5 05 02 00 ff ff .........P...... > 00 10 b5 05 02 00 ff ff 00 b0 18 09 02 00 ff ff ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000007b360995>] __kmalloc_node+0xac/0xd4 > [<0000000050ae8904>] mempool_init_node+0x64/0xd8 > [<00000000e714c59a>] mempool_init+0x14/0x1c > [<000000002f5588b4>] biovec_init_pool+0x24/0x2c > [<00000000a2b87494>] bioset_init+0x208/0x22c > [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130 > [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > unreferenced object 0xffff00020918b000 (size 4096): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.140s) > hex dump (first 32 bytes): > c0 5b 26 03 00 fc ff ff 00 10 00 00 00 00 00 00 .[&............. > 22 01 00 00 00 00 ad de 18 b0 18 09 02 00 ff ff "............... > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c > [<00000000af89e1a3>] mempool_alloc_slab+0x24/0x2c > [<000000002d6118f3>] mempool_init_node+0x94/0xd8 > [<00000000e714c59a>] mempool_init+0x14/0x1c > [<000000002f5588b4>] biovec_init_pool+0x24/0x2c > [<00000000a2b87494>] bioset_init+0x208/0x22c > [<00000000ad63d07f>] bch2_fs_fsio_init+0x8c/0x130 > [<00000000048cf3b9>] bch2_fs_alloc+0x870/0xbcc > [<00000000223e06bf>] bch2_fs_open+0x19c/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > unreferenced object 0xffff000200f2f400 (size 1024): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.140s) > hex dump (first 32 bytes): > 19 00 00 00 00 00 00 00 9d 19 00 00 00 00 00 00 ................ > a6 19 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<00000000cd9515c0>] __kmalloc+0xac/0xd4 > [<0000000097d0e280>] bch2_blacklist_table_initialize+0x48/0xc4 > [<000000007af2f7c0>] bch2_fs_recovery+0x220/0x140c > [<00000000835fe5c8>] bch2_fs_start+0x104/0x2ac > [<00000000f2c8e79f>] bch2_fs_open+0x2cc/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > [<00000000f493e836>] __arm64_sys_mount+0x150/0x168 > [<00000000595788f9>] invoke_syscall.constprop.0+0x70/0xb8 > [<00000000e707b03d>] do_el0_svc+0xbc/0xf0 > [<00000000b4ee996a>] el0_svc+0x74/0x9c > unreferenced object 0xffff00020e544800 (size 128): > comm "mount", pid 3049, jiffies 4294924391 (age 3938.140s) > hex dump (first 32 bytes): > 07 00 00 01 09 00 00 00 73 74 61 72 74 69 6e 67 ........starting > 20 6a 6f 75 72 6e 61 6c 20 61 74 20 65 6e 74 72 journal at entr > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000005602d414>] __kmalloc_node_track_caller+0xa8/0xd0 > [<0000000078a13296>] krealloc+0x7c/0xc4 > [<00000000224f82f4>] __darray_make_room.constprop.0+0x5c/0x7c > [<00000000caa2f6f2>] __bch2_trans_log_msg+0x80/0x12c > [<0000000034a8dfea>] __bch2_fs_log_msg+0x68/0x158 > [<00000000cc0719ad>] bch2_journal_log_msg+0x60/0x98 > [<00000000a0b3d87b>] bch2_fs_recovery+0x8f0/0x140c > [<00000000835fe5c8>] bch2_fs_start+0x104/0x2ac > [<00000000f2c8e79f>] bch2_fs_open+0x2cc/0x430 > [<00000000e72d508e>] bch2_mount+0x194/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > unreferenced object 0xffff00020f2d8398 (size 184): > comm "mount", pid 3049, jiffies 4294924395 (age 3938.128s) > hex dump (first 32 bytes): > 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c > [<0000000059ea6346>] bch2_btree_path_traverse_cached_slowpath+0x240/0x9d0 > [<00000000b340fce9>] bch2_btree_path_traverse_cached+0x7c/0x184 > [<000000006b501914>] bch2_btree_path_traverse_one+0xbc/0x4f0 > [<0000000046611bb8>] bch2_btree_path_traverse+0x20/0x30 > [<00000000cb7378ca>] bch2_btree_iter_peek_slot+0xe4/0x3b0 > [<000000005b36d96f>] __bch2_bkey_get_iter.constprop.0+0x40/0x74 > [<00000000db6c00c7>] bch2_inode_peek+0x80/0xfc > [<00000000d48fafeb>] bch2_inode_find_by_inum_trans+0x34/0x74 > [<00000000d94a8ca3>] bch2_vfs_inode_get+0xdc/0x1a0 > [<00000000b7cffdf2>] bch2_mount+0x3bc/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > [<00000000527b4561>] path_mount+0x5d0/0x6c8 > [<00000000dc643d96>] do_mount+0x80/0xa4 > unreferenced object 0xffff000222f15e00 (size 256): > comm "mount", pid 3049, jiffies 4294924395 (age 3938.128s) > hex dump (first 32 bytes): > 12 81 1d 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 00 00 00 ff ff ff ff 00 10 00 00 00 00 00 00 ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<00000000cd9515c0>] __kmalloc+0xac/0xd4 > [<0000000080dcf5d4>] btree_key_cache_fill+0x190/0x338 > [<00000000142a161b>] bch2_btree_path_traverse_cached_slowpath+0x8d8/0x9d0 > [<00000000b340fce9>] bch2_btree_path_traverse_cached+0x7c/0x184 > [<000000006b501914>] bch2_btree_path_traverse_one+0xbc/0x4f0 > [<0000000046611bb8>] bch2_btree_path_traverse+0x20/0x30 > [<00000000cb7378ca>] bch2_btree_iter_peek_slot+0xe4/0x3b0 > [<000000005b36d96f>] __bch2_bkey_get_iter.constprop.0+0x40/0x74 > [<00000000db6c00c7>] bch2_inode_peek+0x80/0xfc > [<00000000d48fafeb>] bch2_inode_find_by_inum_trans+0x34/0x74 > [<00000000d94a8ca3>] bch2_vfs_inode_get+0xdc/0x1a0 > [<00000000b7cffdf2>] bch2_mount+0x3bc/0x45c > [<00000000b040daa5>] legacy_get_tree+0x2c/0x54 > [<00000000ba80f9a0>] vfs_get_tree+0x28/0xd4 > unreferenced object 0xffff00020ab0da80 (size 128): > comm "fio", pid 3068, jiffies 4294924399 (age 3938.112s) > hex dump (first 32 bytes): > 70 61 74 68 3a 20 69 64 78 20 20 30 20 72 65 66 path: idx 0 ref > 20 30 3a 30 20 50 20 53 20 62 74 72 65 65 3d 73 0:0 P S btree=s > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000005602d414>] __kmalloc_node_track_caller+0xa8/0xd0 > [<0000000078a13296>] krealloc+0x7c/0xc4 > [<00000000ac6de278>] bch2_printbuf_make_room+0x6c/0x9c > [<00000000e73dab89>] bch2_prt_printf+0xac/0x104 > [<00000000ef2c8dc5>] bch2_btree_path_to_text+0x6c/0xb8 > [<00000000eab3e43c>] __bch2_trans_paths_to_text+0x60/0x64 > [<00000000d843d03a>] bch2_trans_paths_to_text+0x10/0x18 > [<00000000fbe77c9c>] bch2_trans_update_max_paths+0x6c/0x104 > [<00000000715f184d>] btree_path_alloc+0x44/0x140 > [<0000000028aac82e>] bch2_path_get+0x190/0x210 > [<000000001fbd1416>] bch2_trans_iter_init_outlined+0xd4/0x100 > [<00000000b7c2c8e8>] bch2_trans_iter_init.constprop.0+0x28/0x30 > [<000000005ee45b0d>] __bch2_dirent_lookup_trans+0xc4/0x20c > [<00000000bf9849b2>] bch2_dirent_lookup+0x9c/0x10c > unreferenced object 0xffff00020f2d8450 (size 184): > comm "fio", pid 3068, jiffies 4294924400 (age 3938.112s) > hex dump (first 32 bytes): > 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000047719e9d>] kmem_cache_alloc+0xd0/0x17c > [<0000000059ea6346>] bch2_btree_path_traverse_cached_slowpath+0x240/0x9d0 > [<00000000b340fce9>] bch2_btree_path_traverse_cached+0x7c/0x184 > [<000000006b501914>] bch2_btree_path_traverse_one+0xbc/0x4f0 > [<0000000046611bb8>] bch2_btree_path_traverse+0x20/0x30 > [<00000000cb7378ca>] bch2_btree_iter_peek_slot+0xe4/0x3b0 > [<000000005b36d96f>] __bch2_bkey_get_iter.constprop.0+0x40/0x74 > [<00000000db6c00c7>] bch2_inode_peek+0x80/0xfc > [<00000000d48fafeb>] bch2_inode_find_by_inum_trans+0x34/0x74 > [<00000000d94a8ca3>] bch2_vfs_inode_get+0xdc/0x1a0 > [<00000000225a6085>] bch2_lookup+0x7c/0xb8 > [<0000000059304a98>] __lookup_slow+0xd4/0x114 > [<000000001225c82d>] walk_component+0x98/0xd4 > [<0000000095114e46>] path_lookupat+0x84/0x114 > [<000000002ee74fa2>] filename_lookup+0x54/0xc4 > unreferenced object 0xffff000222f15a00 (size 256): > comm "fio", pid 3068, jiffies 4294924400 (age 3938.112s) > hex dump (first 32 bytes): > 13 81 1d 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 00 00 00 ff ff ff ff f3 15 00 30 00 00 00 00 ...........0.... > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<00000000cd9515c0>] __kmalloc+0xac/0xd4 > [<0000000080dcf5d4>] btree_key_cache_fill+0x190/0x338 > [<00000000142a161b>] bch2_btree_path_traverse_cached_slowpath+0x8d8/0x9d0 > [<00000000b340fce9>] bch2_btree_path_traverse_cached+0x7c/0x184 > [<000000006b501914>] bch2_btree_path_traverse_one+0xbc/0x4f0 > [<0000000046611bb8>] bch2_btree_path_traverse+0x20/0x30 > [<00000000cb7378ca>] bch2_btree_iter_peek_slot+0xe4/0x3b0 > [<000000005b36d96f>] __bch2_bkey_get_iter.constprop.0+0x40/0x74 > [<00000000db6c00c7>] bch2_inode_peek+0x80/0xfc > [<00000000d48fafeb>] bch2_inode_find_by_inum_trans+0x34/0x74 > [<00000000d94a8ca3>] bch2_vfs_inode_get+0xdc/0x1a0 > [<00000000225a6085>] bch2_lookup+0x7c/0xb8 > [<0000000059304a98>] __lookup_slow+0xd4/0x114 > [<000000001225c82d>] walk_component+0x98/0xd4 > unreferenced object 0xffff0002058e0e00 (size 256): > comm "fio", pid 3081, jiffies 4294924461 (age 3937.868s) > hex dump (first 32 bytes): > 70 61 74 68 3a 20 69 64 78 20 20 31 20 72 65 66 path: idx 1 ref > 20 30 3a 30 20 50 20 20 20 62 74 72 65 65 3d 65 0:0 P btree=e > backtrace: > [<00000000bcb1cd8d>] slab_post_alloc_hook.isra.0+0xb4/0xbc > [<0000000027d98280>] __kmem_cache_alloc_node+0xd0/0x178 > [<000000005602d414>] __kmalloc_node_track_caller+0xa8/0xd0 > [<0000000078a13296>] krealloc+0x7c/0xc4 > [<00000000ac6de278>] bch2_printbuf_make_room+0x6c/0x9c > [<00000000e73dab89>] bch2_prt_printf+0xac/0x104 > [<00000000ef2c8dc5>] bch2_btree_path_to_text+0x6c/0xb8 > [<00000000eab3e43c>] __bch2_trans_paths_to_text+0x60/0x64 > [<00000000d843d03a>] bch2_trans_paths_to_text+0x10/0x18 > [<00000000fbe77c9c>] bch2_trans_update_max_paths+0x6c/0x104 > [<00000000af279ad9>] __bch2_btree_path_make_mut+0x64/0x1d0 > [<00000000b6ea382b>] __bch2_btree_path_set_pos+0x5c/0x1f4 > [<000000001f2292b9>] bch2_btree_path_set_pos+0x68/0x78 > [<0000000046402275>] bch2_btree_iter_peek_slot+0xd0/0x3b0 > [<00000000a578d851>] bchfs_read.isra.0+0x128/0x77c > [<00000000d544d588>] bch2_readahead+0x1a0/0x264 > > -- > Jens Axboe >
On 6/28/23 5:04 PM, Kent Overstreet wrote: > On Wed, Jun 28, 2023 at 04:14:44PM -0600, Jens Axboe wrote: >> Got a whole bunch more running that aio reproducer I sent earlier. I'm >> sure a lot of these are dupes, sending them here for completeness. > > Are you running 'echo scan > /sys/kernel/debug/kmemleak' while the test > is running? I see a lot of spurious leaks when I do that that go away if > I scan after everything's shut down. Nope, and they remain in there. The cat dump I took was an hour later.
On 6/28/23 4:55?PM, Kent Overstreet wrote: >> But it's not aio (or io_uring or whatever), it's simply the fact that >> doing an fput() from an exiting task (for example) will end up being >> done async. And hence waiting for task exits is NOT enough to ensure >> that all file references have been released. >> >> Since there are a variety of other reasons why a mount may be pinned and >> fail to umount, perhaps it's worth considering that changing this >> behavior won't buy us that much. Especially since it's been around for >> more than 10 years: > > Because it seems that before io_uring the race was quite a bit harder to > hit - I only started seeing it when things started switching over to > io_uring. generic/388 used to pass reliably for me (pre backpointers), > now it doesn't. I literally just pasted a script that hits it in one second with aio. So maybe generic/388 doesn't hit it as easily, but it's surely TRIVIAL to hit with aio. As demonstrated. The io_uring is not hard to bring into parity on that front, here's one I posted earlier today for 6.5: https://lore.kernel.org/io-uring/20230628170953.952923-4-axboe@kernel.dk/ Doesn't change the fact that you can easily hit this with io_uring or aio, and probably more things too (didn't look any further). Is it a realistic thing outside of funky tests? Probably not really, or at least if those guys hit it they'd probably have the work-around hack in place in their script already. But the fact is that it's been around for a decade. It's somehow a lot easier to hit with bcachefs than XFS, which may just be because the former has a bunch of workers and this may be deferring the delayed fput work more. Just hand waving. >> then we'd probably want to move that deferred fput list to the >> task_struct and ensure that it gets run if the task exits rather than >> have a global deferred list. Currently we have: >> >> >> 1) If kthread or in interrupt >> 1a) add to global fput list >> 2) task_work_add if not. If that fails, goto 1a. >> >> which would then become: >> >> 1) If kthread or in interrupt >> 1a) add to global fput list >> 2) task_work_add if not. If that fails, we know task is existing. add to >> per-task defer list to be run at a convenient time before task has >> exited. > > no, it becomes: > if we're running in a user task, or if we're doing an operation on > behalf of a user task, add to the user task's deferred list: otherwise > add to global deferred list. And how would the "on behalf of a user task" work in terms of being in_interrupt()?
On Wed, Jun 28, 2023 at 05:14:09PM -0600, Jens Axboe wrote: > On 6/28/23 4:55?PM, Kent Overstreet wrote: > >> But it's not aio (or io_uring or whatever), it's simply the fact that > >> doing an fput() from an exiting task (for example) will end up being > >> done async. And hence waiting for task exits is NOT enough to ensure > >> that all file references have been released. > >> > >> Since there are a variety of other reasons why a mount may be pinned and > >> fail to umount, perhaps it's worth considering that changing this > >> behavior won't buy us that much. Especially since it's been around for > >> more than 10 years: > > > > Because it seems that before io_uring the race was quite a bit harder to > > hit - I only started seeing it when things started switching over to > > io_uring. generic/388 used to pass reliably for me (pre backpointers), > > now it doesn't. > > I literally just pasted a script that hits it in one second with aio. So > maybe generic/388 doesn't hit it as easily, but it's surely TRIVIAL to > hit with aio. As demonstrated. The io_uring is not hard to bring into > parity on that front, here's one I posted earlier today for 6.5: > > https://lore.kernel.org/io-uring/20230628170953.952923-4-axboe@kernel.dk/ > > Doesn't change the fact that you can easily hit this with io_uring or > aio, and probably more things too (didn't look any further). Is it a > realistic thing outside of funky tests? Probably not really, or at least > if those guys hit it they'd probably have the work-around hack in place > in their script already. > > But the fact is that it's been around for a decade. It's somehow a lot > easier to hit with bcachefs than XFS, which may just be because the > former has a bunch of workers and this may be deferring the delayed fput > work more. Just hand waving. Not sure what you're arguing here...? We've had a long standing bug, it's recently become much easier to hit (for multiple reasons); we seem to be in agreement on all that. All I'm saying is that the existence of that bug previously is not reason to fix it now. > >> then we'd probably want to move that deferred fput list to the > >> task_struct and ensure that it gets run if the task exits rather than > >> have a global deferred list. Currently we have: > >> > >> > >> 1) If kthread or in interrupt > >> 1a) add to global fput list > >> 2) task_work_add if not. If that fails, goto 1a. > >> > >> which would then become: > >> > >> 1) If kthread or in interrupt > >> 1a) add to global fput list > >> 2) task_work_add if not. If that fails, we know task is existing. add to > >> per-task defer list to be run at a convenient time before task has > >> exited. > > > > no, it becomes: > > if we're running in a user task, or if we're doing an operation on > > behalf of a user task, add to the user task's deferred list: otherwise > > add to global deferred list. > > And how would the "on behalf of a user task" work in terms of being > in_interrupt()? I don't see any relation to in_interrupt? We'd have to add a version of fput() that takes an additional task_struct argument, and plumb that through the aio code - kioctx lifetime is tied to mm_struct, not task_struct, so we'd have to add a ref to the task_struct to kiocb. Which would probably be a good thing tbh, it'd let us e.g. account cpu time back to the original task when kiocb completion has to run out of a workqueue.
On Wed, Jun 28, 2023 at 07:50:18PM -0400, Kent Overstreet wrote: > On Wed, Jun 28, 2023 at 05:14:09PM -0600, Jens Axboe wrote: > > On 6/28/23 4:55?PM, Kent Overstreet wrote: > > >> But it's not aio (or io_uring or whatever), it's simply the fact that > > >> doing an fput() from an exiting task (for example) will end up being > > >> done async. And hence waiting for task exits is NOT enough to ensure > > >> that all file references have been released. > > >> > > >> Since there are a variety of other reasons why a mount may be pinned and > > >> fail to umount, perhaps it's worth considering that changing this > > >> behavior won't buy us that much. Especially since it's been around for > > >> more than 10 years: > > > > > > Because it seems that before io_uring the race was quite a bit harder to > > > hit - I only started seeing it when things started switching over to > > > io_uring. generic/388 used to pass reliably for me (pre backpointers), > > > now it doesn't. > > > > I literally just pasted a script that hits it in one second with aio. So > > maybe generic/388 doesn't hit it as easily, but it's surely TRIVIAL to > > hit with aio. As demonstrated. The io_uring is not hard to bring into > > parity on that front, here's one I posted earlier today for 6.5: > > > > https://lore.kernel.org/io-uring/20230628170953.952923-4-axboe@kernel.dk/ > > > > Doesn't change the fact that you can easily hit this with io_uring or > > aio, and probably more things too (didn't look any further). Is it a > > realistic thing outside of funky tests? Probably not really, or at least > > if those guys hit it they'd probably have the work-around hack in place > > in their script already. > > > > But the fact is that it's been around for a decade. It's somehow a lot > > easier to hit with bcachefs than XFS, which may just be because the > > former has a bunch of workers and this may be deferring the delayed fput > > work more. Just hand waving. > > Not sure what you're arguing here...? > > We've had a long standing bug, it's recently become much easier to hit > (for multiple reasons); we seem to be in agreement on all that. All I'm > saying is that the existence of that bug previously is not reason to fix > it now. I agree with Kent here - the kernel bug needs to be fixed regardless of how long it has been around. Blaming the messenger (userspace, fstests, etc) and saying it should work around a spurious, unpredictable, undesirable and user-undebuggable kernel behaviour is not an acceptible solution here... I don't care how the kernel bug gets fixed, I just want the spurious unmount failures when there are no userspace processes actively using the filesytsem to go away forever. -Dave.
On 6/28/23 5:50?PM, Kent Overstreet wrote: > On Wed, Jun 28, 2023 at 05:14:09PM -0600, Jens Axboe wrote: >> On 6/28/23 4:55?PM, Kent Overstreet wrote: >>>> But it's not aio (or io_uring or whatever), it's simply the fact that >>>> doing an fput() from an exiting task (for example) will end up being >>>> done async. And hence waiting for task exits is NOT enough to ensure >>>> that all file references have been released. >>>> >>>> Since there are a variety of other reasons why a mount may be pinned and >>>> fail to umount, perhaps it's worth considering that changing this >>>> behavior won't buy us that much. Especially since it's been around for >>>> more than 10 years: >>> >>> Because it seems that before io_uring the race was quite a bit harder to >>> hit - I only started seeing it when things started switching over to >>> io_uring. generic/388 used to pass reliably for me (pre backpointers), >>> now it doesn't. >> >> I literally just pasted a script that hits it in one second with aio. So >> maybe generic/388 doesn't hit it as easily, but it's surely TRIVIAL to >> hit with aio. As demonstrated. The io_uring is not hard to bring into >> parity on that front, here's one I posted earlier today for 6.5: >> >> https://lore.kernel.org/io-uring/20230628170953.952923-4-axboe@kernel.dk/ >> >> Doesn't change the fact that you can easily hit this with io_uring or >> aio, and probably more things too (didn't look any further). Is it a >> realistic thing outside of funky tests? Probably not really, or at least >> if those guys hit it they'd probably have the work-around hack in place >> in their script already. >> >> But the fact is that it's been around for a decade. It's somehow a lot >> easier to hit with bcachefs than XFS, which may just be because the >> former has a bunch of workers and this may be deferring the delayed fput >> work more. Just hand waving. > > Not sure what you're arguing here...? > > We've had a long standing bug, it's recently become much easier to hit > (for multiple reasons); we seem to be in agreement on all that. All I'm > saying is that the existence of that bug previously is not reason to fix > it now. Not really arguing, just stating that it's not a huge problem as it's not something that real world would tend to do and probably why we saw it in a test case instead. >>>> then we'd probably want to move that deferred fput list to the >>>> task_struct and ensure that it gets run if the task exits rather than >>>> have a global deferred list. Currently we have: >>>> >>>> >>>> 1) If kthread or in interrupt >>>> 1a) add to global fput list >>>> 2) task_work_add if not. If that fails, goto 1a. >>>> >>>> which would then become: >>>> >>>> 1) If kthread or in interrupt >>>> 1a) add to global fput list >>>> 2) task_work_add if not. If that fails, we know task is existing. add to >>>> per-task defer list to be run at a convenient time before task has >>>> exited. >>> >>> no, it becomes: >>> if we're running in a user task, or if we're doing an operation on >>> behalf of a user task, add to the user task's deferred list: otherwise >>> add to global deferred list. >> >> And how would the "on behalf of a user task" work in terms of being >> in_interrupt()? > > I don't see any relation to in_interrupt? Just saying that you'd now need the task passed in. > We'd have to add a version of fput() that takes an additional > task_struct argument, and plumb that through the aio code - kioctx > lifetime is tied to mm_struct, not task_struct, so we'd have to add a > ref to the task_struct to kiocb. > > Which would probably be a good thing tbh, it'd let us e.g. account cpu > time back to the original task when kiocb completion has to run out of a > workqueue. Might also introduce some funky dependencies. Probably not an issue it tied to the aio_kiocb. If you go ahead with that, just make sure you keep the task referencing out of the fput variant for users that don't need that.
On 6/28/23 7:00?PM, Dave Chinner wrote: > On Wed, Jun 28, 2023 at 07:50:18PM -0400, Kent Overstreet wrote: >> On Wed, Jun 28, 2023 at 05:14:09PM -0600, Jens Axboe wrote: >>> On 6/28/23 4:55?PM, Kent Overstreet wrote: >>>>> But it's not aio (or io_uring or whatever), it's simply the fact that >>>>> doing an fput() from an exiting task (for example) will end up being >>>>> done async. And hence waiting for task exits is NOT enough to ensure >>>>> that all file references have been released. >>>>> >>>>> Since there are a variety of other reasons why a mount may be pinned and >>>>> fail to umount, perhaps it's worth considering that changing this >>>>> behavior won't buy us that much. Especially since it's been around for >>>>> more than 10 years: >>>> >>>> Because it seems that before io_uring the race was quite a bit harder to >>>> hit - I only started seeing it when things started switching over to >>>> io_uring. generic/388 used to pass reliably for me (pre backpointers), >>>> now it doesn't. >>> >>> I literally just pasted a script that hits it in one second with aio. So >>> maybe generic/388 doesn't hit it as easily, but it's surely TRIVIAL to >>> hit with aio. As demonstrated. The io_uring is not hard to bring into >>> parity on that front, here's one I posted earlier today for 6.5: >>> >>> https://lore.kernel.org/io-uring/20230628170953.952923-4-axboe@kernel.dk/ >>> >>> Doesn't change the fact that you can easily hit this with io_uring or >>> aio, and probably more things too (didn't look any further). Is it a >>> realistic thing outside of funky tests? Probably not really, or at least >>> if those guys hit it they'd probably have the work-around hack in place >>> in their script already. >>> >>> But the fact is that it's been around for a decade. It's somehow a lot >>> easier to hit with bcachefs than XFS, which may just be because the >>> former has a bunch of workers and this may be deferring the delayed fput >>> work more. Just hand waving. >> >> Not sure what you're arguing here...? >> >> We've had a long standing bug, it's recently become much easier to hit >> (for multiple reasons); we seem to be in agreement on all that. All I'm >> saying is that the existence of that bug previously is not reason to fix >> it now. > > I agree with Kent here - the kernel bug needs to be fixed > regardless of how long it has been around. Blaming the messenger > (userspace, fstests, etc) and saying it should work around a > spurious, unpredictable, undesirable and user-undebuggable kernel > behaviour is not an acceptible solution here... Not sure why you both are putting words in my mouth, I've merely been arguing pros and cons and the impact of this. I even linked the io_uring addition for ensuring that side will work better once the deferred fput is sorted out. I didn't like the idea of fixing this through umount, and even outlined how it could be fixed properly by ensuring we flush per-task deferred puts on task exit. Do I think it's a big issue? Not at all, because a) nobody has reported it until now, and b) it's kind of a stupid case. If we can fix it with minimal impact, should we? Yep. Particularly as the assumptions stated in the original commit I referenced were not even valid back then.
On Wed, Jun 28, 2023 at 07:33:18PM -0600, Jens Axboe wrote: > On 6/28/23 7:00?PM, Dave Chinner wrote: > > On Wed, Jun 28, 2023 at 07:50:18PM -0400, Kent Overstreet wrote: > >> On Wed, Jun 28, 2023 at 05:14:09PM -0600, Jens Axboe wrote: > >>> On 6/28/23 4:55?PM, Kent Overstreet wrote: > >>>>> But it's not aio (or io_uring or whatever), it's simply the fact that > >>>>> doing an fput() from an exiting task (for example) will end up being > >>>>> done async. And hence waiting for task exits is NOT enough to ensure > >>>>> that all file references have been released. > >>>>> > >>>>> Since there are a variety of other reasons why a mount may be pinned and > >>>>> fail to umount, perhaps it's worth considering that changing this > >>>>> behavior won't buy us that much. Especially since it's been around for > >>>>> more than 10 years: > >>>> > >>>> Because it seems that before io_uring the race was quite a bit harder to > >>>> hit - I only started seeing it when things started switching over to > >>>> io_uring. generic/388 used to pass reliably for me (pre backpointers), > >>>> now it doesn't. > >>> > >>> I literally just pasted a script that hits it in one second with aio. So > >>> maybe generic/388 doesn't hit it as easily, but it's surely TRIVIAL to > >>> hit with aio. As demonstrated. The io_uring is not hard to bring into > >>> parity on that front, here's one I posted earlier today for 6.5: > >>> > >>> https://lore.kernel.org/io-uring/20230628170953.952923-4-axboe@kernel.dk/ > >>> > >>> Doesn't change the fact that you can easily hit this with io_uring or > >>> aio, and probably more things too (didn't look any further). Is it a > >>> realistic thing outside of funky tests? Probably not really, or at least > >>> if those guys hit it they'd probably have the work-around hack in place > >>> in their script already. > >>> > >>> But the fact is that it's been around for a decade. It's somehow a lot > >>> easier to hit with bcachefs than XFS, which may just be because the > >>> former has a bunch of workers and this may be deferring the delayed fput > >>> work more. Just hand waving. > >> > >> Not sure what you're arguing here...? > >> > >> We've had a long standing bug, it's recently become much easier to hit > >> (for multiple reasons); we seem to be in agreement on all that. All I'm > >> saying is that the existence of that bug previously is not reason to fix > >> it now. > > > > I agree with Kent here - the kernel bug needs to be fixed > > regardless of how long it has been around. Blaming the messenger > > (userspace, fstests, etc) and saying it should work around a > > spurious, unpredictable, undesirable and user-undebuggable kernel > > behaviour is not an acceptible solution here... > > Not sure why you both are putting words in my mouth, I've merely been > arguing pros and cons and the impact of this. I even linked the io_uring > addition for ensuring that side will work better once the deferred fput > is sorted out. I didn't like the idea of fixing this through umount, and > even outlined how it could be fixed properly by ensuring we flush > per-task deferred puts on task exit. > > Do I think it's a big issue? Not at all, because a) nobody has reported > it until now, and b) it's kind of a stupid case. If we can fix it with Agreed. > minimal impact, should we? Yep. Particularly as the assumptions stated > in the original commit I referenced were not even valid back then. There seems to be a wild misconception here which frankly is very concering. Afaik, it is absolutely not the case that an fput() from an exiting task ends up in delayed_work(). But I'm happy to be convinced otherwise. But thinking about it for more than a second, it would mean that __every single task__ that passes through do_exit() would punt all its files to delayed_fput() for closing. __Every single task on the system__. What sort of DOS vector do people think we built into the kernel's exit path? Hundreds or thousands of systemd services can have thousands of fds open and somehow we punt them all to delayed_fput() when they get killed, shutdown, or exit? do_exit() -> io_uring_files_cancel() /* can register task work */ -> exit_files() /* can register task work */ -> exit_task_work() -> task_work_run() /* run queued task work and when done set &work_exited sentinel */ Only after exit_task_work() is called do we need to rely on delayed_fput(). But all files in the tasks fd table will have already been registered for cleaned up in exit_files() via task work if that task does indeed hold the last reference. Unless, we're in an interrupt context or we're dealing with a PF_KTHREAD... So, generic/388 calls fsstress. If aio and/or io_uring is present fsstress will be linked against aio/io_uring and will execute these codepaths in its default invocation. Compile out the aio and io_uring support, register a kretprobe via bpftrace and snoop on how many times delayed_fput() is called when the test is run with however many threads you want: absolutely not a single time. Every SIGKILLed task goes through do_exit(), exit_files(), registers their fputs as task work and then calls exit_task_work(), runs task work, then disables task work and finally dies peacefully. Now compile in aio and io_uring support, register a kretprobe via bpftrace and snoop on how many times delayed_fput() is called and see frequent delayed_fput() calls for aio and less frequent delayed_fput() calls for io_uring. Taking aio as an example, if we SIGKILL the last userspace process with the aio fd it will exit and at some point the kernel will hit exit_aio() on last __mmput(): do_exit() -> exit_mm() -> mmput() /* mm could be pinned by some crap somewhere */ -> exit_aio() -> io_uring_files_cancel() /* can use task work */ -> exit_files() /* can use task work */ -> exit_task_work() /* no more task work after that */ -> task_work_run() /* run queued task work and only when done set &work_exited sentinel */ If there are any outstanding io requests that haven't been completed then aio will cancel them and punt them onto the system work queue. Which is, surprise, a PF_KTHREAD. So then delayed_fput() is hit. io_uring hits this less frequently but it does punt work to a kthread via io_fallback_tw() iirc if the current task is PF_EXITING and so uses delayed_fput() in these scenarios. So this is async io punting to a kthread for cleanup _explicitly_ that's causing issues here and it is an async io problem. Why is this racy? For this to become a problem delayed_fput() work must've been registered plus the delayed_fput() being called from the system workqueue must take longer than the task work by all of the other exiting threads. We give zero f***s about legacy aio - Heidi meme-style. Not a single line of code in the VFS will be complicated because of this legacy cruft that everyone hates with a passion. Ultimately it's even why we ended up with the nice io_uring io_worker model. And io_uring - Jens can correct me - can probably be improved to rely task work even if the task is PF_EXITING as long as exit_task_work() hasn't been called for that task which I reckon it hasn't. So probably how io_uring cancels work in io_uring_files_cancel() needs to be tweaked if that really is an issue. But hard NAK on fiddling with umount for any of this. Umount has never and will never give any guarantee that a superblock is gone when it returns. Even if it succeeds and returns it doesn't mean that the superblock has gone away and that the filesystem can be mounted again fresh. Bind mounts, mount namespaces, mount propagation, independent of someone pinning files in a given mount or kthread based async io fput cleaning make this completely meaningless. We can't guarantee that and we will absolutely not get in the business of that in the umount code. If this generic/388 test is to be reliable rn in the face of async io, the synchronization method via fsnotify that allows you get notified about superblock destruction should be used.
On Thu, Jun 29, 2023 at 01:18:11PM +0200, Christian Brauner wrote: > There seems to be a wild misconception here which frankly is very > concering. Afaik, it is absolutely not the case that an fput() from an > exiting task ends up in delayed_work(). But I'm happy to be convinced > otherwise. I already explained the real issue - it's fput() from an AIO completion, because that has no association with task it was done on behalf of.
On Thu, Jun 29, 2023 at 01:18:11PM +0200, Christian Brauner wrote: > On Wed, Jun 28, 2023 at 07:33:18PM -0600, Jens Axboe wrote: > > On 6/28/23 7:00?PM, Dave Chinner wrote: > > > On Wed, Jun 28, 2023 at 07:50:18PM -0400, Kent Overstreet wrote: > > >> On Wed, Jun 28, 2023 at 05:14:09PM -0600, Jens Axboe wrote: > > >>> On 6/28/23 4:55?PM, Kent Overstreet wrote: > > >>>>> But it's not aio (or io_uring or whatever), it's simply the fact that > > >>>>> doing an fput() from an exiting task (for example) will end up being > > >>>>> done async. And hence waiting for task exits is NOT enough to ensure > > >>>>> that all file references have been released. > > >>>>> > > >>>>> Since there are a variety of other reasons why a mount may be pinned and > > >>>>> fail to umount, perhaps it's worth considering that changing this > > >>>>> behavior won't buy us that much. Especially since it's been around for > > >>>>> more than 10 years: > > >>>> > > >>>> Because it seems that before io_uring the race was quite a bit harder to > > >>>> hit - I only started seeing it when things started switching over to > > >>>> io_uring. generic/388 used to pass reliably for me (pre backpointers), > > >>>> now it doesn't. > > >>> > > >>> I literally just pasted a script that hits it in one second with aio. So > > >>> maybe generic/388 doesn't hit it as easily, but it's surely TRIVIAL to > > >>> hit with aio. As demonstrated. The io_uring is not hard to bring into > > >>> parity on that front, here's one I posted earlier today for 6.5: > > >>> > > >>> https://lore.kernel.org/io-uring/20230628170953.952923-4-axboe@kernel.dk/ > > >>> > > >>> Doesn't change the fact that you can easily hit this with io_uring or > > >>> aio, and probably more things too (didn't look any further). Is it a > > >>> realistic thing outside of funky tests? Probably not really, or at least > > >>> if those guys hit it they'd probably have the work-around hack in place > > >>> in their script already. > > >>> > > >>> But the fact is that it's been around for a decade. It's somehow a lot > > >>> easier to hit with bcachefs than XFS, which may just be because the > > >>> former has a bunch of workers and this may be deferring the delayed fput > > >>> work more. Just hand waving. > > >> > > >> Not sure what you're arguing here...? > > >> > > >> We've had a long standing bug, it's recently become much easier to hit > > >> (for multiple reasons); we seem to be in agreement on all that. All I'm > > >> saying is that the existence of that bug previously is not reason to fix > > >> it now. > > > > > > I agree with Kent here - the kernel bug needs to be fixed > > > regardless of how long it has been around. Blaming the messenger > > > (userspace, fstests, etc) and saying it should work around a > > > spurious, unpredictable, undesirable and user-undebuggable kernel > > > behaviour is not an acceptible solution here... > > > > Not sure why you both are putting words in my mouth, I've merely been > > arguing pros and cons and the impact of this. I even linked the io_uring > > addition for ensuring that side will work better once the deferred fput > > is sorted out. I didn't like the idea of fixing this through umount, and > > even outlined how it could be fixed properly by ensuring we flush > > per-task deferred puts on task exit. > > > > Do I think it's a big issue? Not at all, because a) nobody has reported > > it until now, and b) it's kind of a stupid case. If we can fix it with > > Agreed. yeah, the rest of this email that I snipped is _severely_ confused about what is going on here. Look, the main thing I want to say is - I'm not at all impressed by this continual evasiveness from you and Jens. It's a bug, it needs to be fixed. We are engineers. It is our literal job to do the hard work and solve the hard problems, and leave behind a system more robust and more reliable for the people who come after us to use. Not to kick the can down the line and leave lurking landmines in the form of "oh you just have to work around this like x..."
On Thu, Jun 29, 2023 at 11:31:09AM -0400, Kent Overstreet wrote: > On Thu, Jun 29, 2023 at 01:18:11PM +0200, Christian Brauner wrote: > > On Wed, Jun 28, 2023 at 07:33:18PM -0600, Jens Axboe wrote: > > > On 6/28/23 7:00?PM, Dave Chinner wrote: > > > > On Wed, Jun 28, 2023 at 07:50:18PM -0400, Kent Overstreet wrote: > > > >> On Wed, Jun 28, 2023 at 05:14:09PM -0600, Jens Axboe wrote: > > > >>> On 6/28/23 4:55?PM, Kent Overstreet wrote: > > > >>>>> But it's not aio (or io_uring or whatever), it's simply the fact that > > > >>>>> doing an fput() from an exiting task (for example) will end up being > > > >>>>> done async. And hence waiting for task exits is NOT enough to ensure > > > >>>>> that all file references have been released. > > > >>>>> > > > >>>>> Since there are a variety of other reasons why a mount may be pinned and > > > >>>>> fail to umount, perhaps it's worth considering that changing this > > > >>>>> behavior won't buy us that much. Especially since it's been around for > > > >>>>> more than 10 years: > > > >>>> > > > >>>> Because it seems that before io_uring the race was quite a bit harder to > > > >>>> hit - I only started seeing it when things started switching over to > > > >>>> io_uring. generic/388 used to pass reliably for me (pre backpointers), > > > >>>> now it doesn't. > > > >>> > > > >>> I literally just pasted a script that hits it in one second with aio. So > > > >>> maybe generic/388 doesn't hit it as easily, but it's surely TRIVIAL to > > > >>> hit with aio. As demonstrated. The io_uring is not hard to bring into > > > >>> parity on that front, here's one I posted earlier today for 6.5: > > > >>> > > > >>> https://lore.kernel.org/io-uring/20230628170953.952923-4-axboe@kernel.dk/ > > > >>> > > > >>> Doesn't change the fact that you can easily hit this with io_uring or > > > >>> aio, and probably more things too (didn't look any further). Is it a > > > >>> realistic thing outside of funky tests? Probably not really, or at least > > > >>> if those guys hit it they'd probably have the work-around hack in place > > > >>> in their script already. > > > >>> > > > >>> But the fact is that it's been around for a decade. It's somehow a lot > > > >>> easier to hit with bcachefs than XFS, which may just be because the > > > >>> former has a bunch of workers and this may be deferring the delayed fput > > > >>> work more. Just hand waving. > > > >> > > > >> Not sure what you're arguing here...? > > > >> > > > >> We've had a long standing bug, it's recently become much easier to hit > > > >> (for multiple reasons); we seem to be in agreement on all that. All I'm > > > >> saying is that the existence of that bug previously is not reason to fix > > > >> it now. > > > > > > > > I agree with Kent here - the kernel bug needs to be fixed > > > > regardless of how long it has been around. Blaming the messenger > > > > (userspace, fstests, etc) and saying it should work around a > > > > spurious, unpredictable, undesirable and user-undebuggable kernel > > > > behaviour is not an acceptible solution here... > > > > > > Not sure why you both are putting words in my mouth, I've merely been > > > arguing pros and cons and the impact of this. I even linked the io_uring > > > addition for ensuring that side will work better once the deferred fput > > > is sorted out. I didn't like the idea of fixing this through umount, and > > > even outlined how it could be fixed properly by ensuring we flush > > > per-task deferred puts on task exit. > > > > > > Do I think it's a big issue? Not at all, because a) nobody has reported > > > it until now, and b) it's kind of a stupid case. If we can fix it with > > > > Agreed. > > yeah, the rest of this email that I snipped is _severely_ confused about > what is going on here. > > Look, the main thing I want to say is - I'm not at all impressed by this > continual evasiveness from you and Jens. It's a bug, it needs to be > fixed. > > We are engineers. It is our literal job to do the hard work and solve > the hard problems, and leave behind a system more robust and more > reliable for the people who come after us to use. > > Not to kick the can down the line and leave lurking landmines in the > form of "oh you just have to work around this like x..." We're all not very impressed with that's going on here. I think everyone has made that pretty clear. It's worrying that this reply is so quickly and happily turning to "I'm a real engineer" and "you're confused" tropes and then isn't even making a clear point. Going forward this should stop otherwise I'll cease replying. Nothing I said was confused. The discussion was initially trying to fix this in umount and we're not going to fix async aio behavior in umount. My earlier mail clearly said that io_uring can be changed by Jens pretty quickly to not cause such test failures. But there's a trade-off to be considered where we have to introduce new sensitive and complicated file cleanup code for the sake of the legacy aio api that even the manpage marks as incomplete and buggy. And all for an issue that was only ever found out in a test and for behavior that's existed since the dawn of time. "We're real engineers" is not an argument for that trade off being sensible.
On Fri, Jun 30, 2023 at 11:40:32AM +0200, Christian Brauner wrote: > We're all not very impressed with that's going on here. I think everyone > has made that pretty clear. > > It's worrying that this reply is so quickly and happily turning to > "I'm a real engineer" and "you're confused" tropes and then isn't even > making a clear point. Going forward this should stop otherwise I'll > cease replying. > > Nothing I said was confused. The discussion was initially trying to fix > this in umount and we're not going to fix async aio behavior in umount. Christain, why on earth would we be trying to fix this in umount? All you posted was a stack trace and something handwavy about how fixing it in umount would be hard, and yes it would be! That's crazy! This is a basic lifetime issue, where we just need to make sure that refcounts are getting released at the appropriate place and not being delayed for arbitrarily long (i.e. the global delayed fput list, which honestly we should probably try to get rid of). Furthermore, when issues with fput have caused umount to fail in the past it's always been considered a bug - see the addition of __fput_sync(), if you do some searching you should be able to find multiple patches where this has been dealt with. > My earlier mail clearly said that io_uring can be changed by Jens pretty > quickly to not cause such test failures. Jens posted a fix that didn't actually fix anything, and after that it seemed neither of you were interested in actually fixing this. So based on that, maybe we need to consider switching fstests back to AIO just so we can get work done...
On Mon, Jun 26, 2023 at 05:47:01PM -0400, Kent Overstreet wrote: > Hi Linus, > > Here it is, the bcachefs pull request. For brevity the list of patches > below is only the initial part of the series, the non-bcachefs prep > patches and the first bcachefs patch, but the diffstat is for the entire > series. > > Six locks has all the changes you suggested, text size went down > significantly. If you'd still like this to see more review from the > locking people, I'm not against them living in fs/bcachefs/ as an > interim; perhaps Dave could move them back to kernel/locking when he > starts using them or when locking people have had time to look at them - > I'm just hoping for this to not block the merge. > > Recently some people have expressed concerns about "not wanting a repeat > of ntfs3" - from what I understand the issue there was just severe > buggyness, so perhaps showing the bcachefs automated test results will > help with that: > > https://evilpiepirate.org/~testdashboard/ci > > The main bcachefs branch runs fstests and my own test suite in several > varations, including lockdep+kasan, preempt, and gcov (we're at 82% line > coverage); I'm not currently seeing any lockdep or kasan splats (or > panics/oopses, for that matter). > > (Worth noting the bug causing the most test failures by a wide margin is > actually an io_uring bug that causes random umount failures in shutdown > tests. Would be great to get that looked at, it doesn't just affect > bcachefs). > > Regarding feature status - most features are considered stable and ready > for use, snapshots and erasure coding are both nearly there. But a > filesystem on this scale is a massive project, adequately conveying the > status of every feature would take at least a page or two. > > We may want to mark it as EXPERIMENTAL for a few releases, I haven't > done that as yet. (I wouldn't consider single device without snapshots > to be experimental, but - given that the number of users and bug reports > is about to shoot up, perhaps I should...). Restarting the discussion after the holiday weekend, hoping to get something more substantive going: Hoping to get: - Thoughts from people who have been following bcachefs development, and people who have looked at the code - Continuation of the LSF discussion - maybe some people could repeat here what they said there (re: code review, iomap, etc.) - Any concerns about how this might impact the rest of the kernel, or discussion about what impact merging a new filesystem is likely to have on other people's work AFAIK the only big ask that hasn't happened yet is better documentation: David Howells wanted (better) a man page, which is definitely something that needs to happen but it'll be some months before I'm back to working on documentation - I'm happy to share my current list of priorities if that would be helpful. In the meantime, the Latex principles of operation is reasonably up to date (and I intend to greatly expand the sections on on disk data structures, I think that'll be great reference documentation for developers getting up to speed on the code) https://bcachefs.org/bcachefs-principles-of-operation.pdf I feel that bcachefs is in a pretty mature state at this point, but it's also _huge_, which is a bit different than e.g. the btrfs merger; it's hard to know where to start to get a meaninful discussion/review process going. Patch bombing the mailing list with 90k loc is clearly not going to be productive, which is why I've been trying to talk more about development process and status - but all suggestions and feedback are welcome. Cheers, Kent
On 7/6/23 9:20?AM, Kent Overstreet wrote: >> My earlier mail clearly said that io_uring can be changed by Jens pretty >> quickly to not cause such test failures. > > Jens posted a fix that didn't actually fix anything, and after that it > seemed neither of you were interested in actually fixing this. So > based on that, maybe we need to consider switching fstests back to AIO > just so we can get work done... Yeah let's keep misrepresenting... I already showed how to hit this easily with aio, and you said you'd fix aio. But nothing really happened there, unsurprisingly. You do what you want, as per usual these threads just turn into an unproductive (and waste of time) shit show. Muted on my end from now on.
On Thu, Jul 06, 2023 at 10:26:34AM -0600, Jens Axboe wrote: > On 7/6/23 9:20?AM, Kent Overstreet wrote: > >> My earlier mail clearly said that io_uring can be changed by Jens pretty > >> quickly to not cause such test failures. > > > > Jens posted a fix that didn't actually fix anything, and after that it > > seemed neither of you were interested in actually fixing this. So > > based on that, maybe we need to consider switching fstests back to AIO > > just so we can get work done... > > Yeah let's keep misrepresenting... I already showed how to hit this > easily with aio, and you said you'd fix aio. But nothing really happened > there, unsurprisingly. Jens, your test case showing that this happens on aio too was appreciated: I was out of town for the holiday weekend, and I'm just now back home catching up and fixing your test case is the first thing I'm working on. But like I said, this wasn't causing test failures when we were using AIO, it's only since we switched to io_uring that this has become an issue, and I'm not the only one telling you this is an issue, so ball very much in your court. > You do what you want, as per usual these threads just turn into an > unproductive (and waste of time) shit show. Muted on my end from now on. Ok.
On Thu, Jul 06, 2023 at 11:56:02AM -0400, Kent Overstreet wrote: > On Mon, Jun 26, 2023 at 05:47:01PM -0400, Kent Overstreet wrote: > > Hi Linus, > > > > Here it is, the bcachefs pull request. For brevity the list of patches > > below is only the initial part of the series, the non-bcachefs prep > > patches and the first bcachefs patch, but the diffstat is for the entire > > series. > > > > Six locks has all the changes you suggested, text size went down > > significantly. If you'd still like this to see more review from the > > locking people, I'm not against them living in fs/bcachefs/ as an > > interim; perhaps Dave could move them back to kernel/locking when he > > starts using them or when locking people have had time to look at them - > > I'm just hoping for this to not block the merge. > > > > Recently some people have expressed concerns about "not wanting a repeat > > of ntfs3" - from what I understand the issue there was just severe > > buggyness, so perhaps showing the bcachefs automated test results will > > help with that: > > > > https://evilpiepirate.org/~testdashboard/ci > > > > The main bcachefs branch runs fstests and my own test suite in several > > varations, including lockdep+kasan, preempt, and gcov (we're at 82% line > > coverage); I'm not currently seeing any lockdep or kasan splats (or > > panics/oopses, for that matter). > > > > (Worth noting the bug causing the most test failures by a wide margin is > > actually an io_uring bug that causes random umount failures in shutdown > > tests. Would be great to get that looked at, it doesn't just affect > > bcachefs). > > > > Regarding feature status - most features are considered stable and ready > > for use, snapshots and erasure coding are both nearly there. But a > > filesystem on this scale is a massive project, adequately conveying the > > status of every feature would take at least a page or two. > > > > We may want to mark it as EXPERIMENTAL for a few releases, I haven't > > done that as yet. (I wouldn't consider single device without snapshots > > to be experimental, but - given that the number of users and bug reports > > is about to shoot up, perhaps I should...). > > Restarting the discussion after the holiday weekend, hoping to get > something more substantive going: > > Hoping to get: > - Thoughts from people who have been following bcachefs development, > and people who have looked at the code > - Continuation of the LSF discussion - maybe some people could repeat > here what they said there (re: code review, iomap, etc.) > - Any concerns about how this might impact the rest of the kernel, or > discussion about what impact merging a new filesystem is likely to > have on other people's work > > AFAIK the only big ask that hasn't happened yet is better documentation: > David Howells wanted (better) a man page, which is definitely something > that needs to happen but it'll be some months before I'm back to working > on documentation - I'm happy to share my current list of priorities if > that would be helpful. > > In the meantime, the Latex principles of operation is reasonably up to > date (and I intend to greatly expand the sections on on disk data > structures, I think that'll be great reference documentation for > developers getting up to speed on the code) > > https://bcachefs.org/bcachefs-principles-of-operation.pdf > > I feel that bcachefs is in a pretty mature state at this point, but it's > also _huge_, which is a bit different than e.g. the btrfs merger; it's > hard to know where to start to get a meaninful discussion/review process > going. > > Patch bombing the mailing list with 90k loc is clearly not going to be > productive, which is why I've been trying to talk more about development > process and status - but all suggestions and feedback are welcome. I've been watching this from the sidelines sort of busy with other things, but I realize that comments I made at LSFMMBPF have been sort of taken as the gospel truth and I want to clear some of that up. I said this at LSFMMBPF, and I haven't said it on list before so I'll repeat it here. I'm of the opinion that me and any other outsider reviewing the bcachefs code in bulk is largely useless. I could probably do things like check for locking stuff and other generic things. You have patches that are outside of fs/bcachefs. Get those merged and then do a pull with just fs/bcachefs, because again posting 90k loc is going to be unwieldy and the quality of review just simply will not make a difference. Alternatively rework your code to not have any dependencies outside of fs/bcachefs. This is what btrfs did. That merge didn't touch anything outside of fs/btrfs. This merge attempt has gone off the rails, for what appears to be a few common things. 1) The external dependencies. There's a reason I was really specific about what I said at LSFMMBPF, both this year and in 2022. Get these patches merged first, the rest will be easier. You are burning a lot of good will being combative with people over these dependencies. This is not the hill to die on. You want bcachefs in the kernel and to get back to bcachefs things. Make the changes you need to make to get these dependencies in, or simply drop the need for them and come back to it later after bcachefs is merged. 2) We already have recent examples of merge and disappear. Yes of course you've been around for a long time, you aren't the NTFS developers. But as you point out it's 90k of code. When btrfs was merged there were 3 large contributors, Chris, myself, and Yanzheng. If Chris got hit by a bus we could still drive the project forward. Can the same be said for bachefs? I know others have chimed in and done some stuff, but as it's been stated elsewhere it would be good to have somebody else in the MAINTAINERS file with you. I am really, really wanting you to succeed here Kent. If the general consensus is you need to have some idiot review fs/bcachefs I will happily carve out some time and dig in. At this point however it's time to be pragmatic. Stop dying on every hill, it's not worth it. Ruthlessly prioritize and do what needs to be done to get this thing merged. Christian saying he's almost ready to stop replying should be a wakeup call that your approach is not working. Thanks, Josef
On Thu, Jul 06, 2023 at 12:40:55PM -0400, Josef Bacik wrote: > I've been watching this from the sidelines sort of busy with other things, but I > realize that comments I made at LSFMMBPF have been sort of taken as the gospel > truth and I want to clear some of that up. > > I said this at LSFMMBPF, and I haven't said it on list before so I'll repeat it > here. > > I'm of the opinion that me and any other outsider reviewing the bcachefs code in > bulk is largely useless. I could probably do things like check for locking > stuff and other generic things. Yeah, agreed. And the generic things - that's what we've got automated testing for; there's a reason I've been putting so much effort into automated testing over (especially) the past year. > You have patches that are outside of fs/bcachefs. Get those merged and then do > a pull with just fs/bcachefs, because again posting 90k loc is going to be > unwieldy and the quality of review just simply will not make a difference. > > Alternatively rework your code to not have any dependencies outside of > fs/bcachefs. This is what btrfs did. That merge didn't touch anything outside > of fs/btrfs. We've had other people saying, at multiple times in the past, that patches that are only needed for bcachefs should be part of the initial pull instead of going in separately. I've already cut down the non-bcachefs pull quite a bit, even to the point of making non-ideal engineering choices, and if I have to cut it down more it's going to mean more ugly choices. > This merge attempt has gone off the rails, for what appears to be a few common > things. > > 1) The external dependencies. There's a reason I was really specific about what > I said at LSFMMBPF, both this year and in 2022. Get these patches merged first, > the rest will be easier. You are burning a lot of good will being combative > with people over these dependencies. This is not the hill to die on. You want > bcachefs in the kernel and to get back to bcachefs things. Make the changes you > need to make to get these dependencies in, or simply drop the need for them and > come back to it later after bcachefs is merged. Look, I'm not at all trying to be combative, I'm just trying to push things forward. The one trainwreck-y thread was regarding vmalloc_exec(), and posting that patch needed to happen in order to figure out what was even going to happen regarding the dynamic codegen going forward. It's been dropped from the initial pull, and dynamic codegen is going to wait on a better executable memory allocation API. (and yes, that thread _was_ a trainwreck; it's not good when you have maintainers claiming endlessly that something is broken and making arguments to authority but _not able to explain why_. The thread on the new executable memory allocator still needs something more concrete on the issues with speculative execution from Andy or someone else). Let me just lay out the non-bcachefs dependencies: - two lockdep patches: these could technically be dropped from the series, but that would mean dropping lockdep support entirely for btree node locks, and even Linus has said we need to get rid of lockdep_no_validate_class so I'm hoping to avoid that. - six locks: this shouldn't be blocking, we can move them to fs/bcachefs/ if Linus still feels they need more review - but Dave Chinner was wanting them and the locking people disliked exporting osq_lock so that's why I have them in kernel/locking. - mean_and_variance: this is some statistics library code that computes mean and standard deviation for time samples, both standard and exponentially weighted. Again, bcachefs is the first user so this pull request is the proper place for this code, and I'm intending to convert bcache to this code as well as use it for other kernel wide latency tracking (which I demod at LSF awhile back; I'll be posting it again once code tagging is upstreamed as part of the memory allocation work Suren and I are doing). - task_struct->faults_disabled_mapping: this adds a task_struct member that makes it possible to do strict page cache coherency. This is something I intend to push into the VFS, but it's going to be a big project - it needs a new type of lock (the one in bcachefs is good enough for an initial implementation, but the real version probably needs priority inheritence and other things). In the meantime, I've thoroughly documented what's going on and what the plan is in the commit message. - d_mark_tmpfile(): trivial new helper, from pulling out part of d_tmpfile(). We need this because bcachefs handles the nlink count for tmpfiles earlier, in the btree transaction. - copy_folio_from_iter_atomic(): obvious new helper, other filesystems will want this at some point as part of the ongoing folio conversion - block layer patches: we have - new exports: primarily because bcachefs has its own dio path and does not use iomap, also blk_status_to_str() for better error messages - bio_iov_iter_get_pages() with bio->bi_bdev unset: bcachefs builds up bios before we know which block device those bios will be issued to. There was something thrown out about "bi_bdev being required" - but that doesn't make a lot of sense here. The direction in the block layer back when I made it sane for stacking block drivers - i.e. enabling efficient splitting/cloning of bios - was towards bios being more just simple iterators over a scatter/gather list, and now we've got iov_iter which can point at a bio/bvec array - moving even more in that direction. Regardless, this patch is pretty trivial, it's not something that commits us to one particular approach. bio_iov_iter_get_pages() is here trying to return bios that are aligned to the block device's blocksize, but in our case we just want it aligned to the filesystem's blocksize. - bring back zero_fill_bio_iter() - I originally wrote this, Christoph deleted it without checking. It's just a more general version of zero_fill_bio(). - Don't block on s_umount from __invalidate_super: this is a bugfix for a deadlock in generic/042 because of how we use sget(), the commit message goes into more detail. bcachefs doesn't use sget() for mutual exclusion because a) sget() is insane, what we really want is the _block device_ to be opened exclusively (which we do), and we have our own block device opening path - which we need to, as we're a multi device filesystem. - generic radix tree fixes: this is just fixes for code I already wrote for bcachefs and upstreamed previously, after converting existing users of flex-array. - move closures to lib/ - this is also code I wrote, now needs to be shared with bcache - small stuff: - export stack_trace_save_stack() - this is used for displaying stack traces in debugfs - export errname() - better error messages - string_get_size() - change it to return number of characters written - compiler attributes - add __flatten If there are objections to any of these patches, please _be specific_. Please remember that I am also incorporating feedback previous discussions, and a generic "these patches need to go in separately" is not something I can do anything with, as explained previously. > 2) We already have recent examples of merge and disappear. Yes of course you've > been around for a long time, you aren't the NTFS developers. But as you point > out it's 90k of code. When btrfs was merged there were 3 large contributors, > Chris, myself, and Yanzheng. If Chris got hit by a bus we could still drive the > project forward. Can the same be said for bachefs? I know others have chimed > in and done some stuff, but as it's been stated elsewhere it would be good to > have somebody else in the MAINTAINERS file with you. Yes, the bcachefs project needs to grow in terms of developers. The unfortunate reality is that right now is a _hard_ time to growing teams and budgets in this area; it's been an uphill battle. You, the btrfs developers, got started when Linux filesystem teams were quite a bit bigger than they are now: I was at Google when Google had a bunch of people working on ext4, and that was when ZFS had recently come out and there was recognition that Linux needed an answer to ZFS and you were able to ride that excitement. It's been a bit harder for me to get something equally ambitions going, to be honest. But years ago when I realized I was onto something, I decided this project was only going to fail if I let it fail - so I'm in it for the long haul. Right now what I'm hearing, in particular from Redhat, is that they want it upstream in order to commit more resources. Which, I know, is not what kernel people want to hear, but it's the chicken-and-the-egg situation I'm in. > I am really, really wanting you to succeed here Kent. If the general consensus > is you need to have some idiot review fs/bcachefs I will happily carve out some > time and dig in. That would be much appreciated - I'll owe you some beers next time I see you. But before jumping in, let's see if we can get people who have already worked with the code to say something. Something I've done in the past that might be helpful - instead (or in addition to) having people read code in isolation, perhaps we could do a group call/meeting where people can ask questions about the code, bring up design issues they've seen in other filesystems, etc. - I've also found that kind of setup great for identifying places in the code where additional documentation would be useful.
On 7/6/23 12:38 PM, Kent Overstreet wrote: > Right now what I'm hearing, in particular from Redhat, is that they want > it upstream in order to commit more resources. Which, I know, is not > what kernel people want to hear, but it's the chicken-and-the-egg > situation I'm in. I need to temper that a little. Folks in and around filesystems and storage at Red Hat find bcachefs to be quite compelling and interesting, and we've spent some resources in the past several months to investigate, test, benchmark, and even do some bugfixing. Upstream acceptance is going to be a necessary condition for almost any distro to consider shipping or investing significantly in bcachefs. But it's not a given that once it's upstream we'll immediately commit more resources - I just wanted to clarify that. It is a tough chicken and egg problem to be sure. That said, I think you're right Kent - landing it upstream will quite likely encourage more interest, users, and hopefully developers. Maybe it'd be reasonable to mark bcachefs as EXPERIMENTAL or similar in Kconfig, documentation, and printks - it'd give us options in case it doesn't attract developers and Kent does get hit by a bus or decide to go start a goat farm instead (i.e. in the worst case, it could be yanked, having set expectations.) -Eric
On Thu, Jul 06, 2023 at 02:17:29PM -0500, Eric Sandeen wrote: > On 7/6/23 12:38 PM, Kent Overstreet wrote: > > Right now what I'm hearing, in particular from Redhat, is that they want > > it upstream in order to commit more resources. Which, I know, is not > > what kernel people want to hear, but it's the chicken-and-the-egg > > situation I'm in. > > I need to temper that a little. Folks in and around filesystems and storage > at Red Hat find bcachefs to be quite compelling and interesting, and we've > spent some resources in the past several months to investigate, test, > benchmark, and even do some bugfixing. > > Upstream acceptance is going to be a necessary condition for almost any > distro to consider shipping or investing significantly in bcachefs. But it's > not a given that once it's upstream we'll immediately commit more resources > - I just wanted to clarify that. Yeah, I should probably have worder that a bit better. But in the conversations I've had with people at other companies it does sound like the interest is there, it's just that filesystem/storage teams are not so big these days as to support investing in something that is not yet mainlined. > It is a tough chicken and egg problem to be sure. That said, I think you're > right Kent - landing it upstream will quite likely encourage more interest, > users, and hopefully developers. Gotta start somewhere :) > Maybe it'd be reasonable to mark bcachefs as EXPERIMENTAL or similar in > Kconfig, documentation, and printks - it'd give us options in case it > doesn't attract developers and Kent does get hit by a bus or decide to go > start a goat farm instead (i.e. in the worst case, it could be yanked, > having set expectations.) Yeah, it does need to be marked EXPERIMENTAL initially, regardless - staged rollout please, not everyone all at once :)
On Wed, Jun 28, 2023 at 03:17:43PM -0600, Jens Axboe wrote: > On 6/28/23 2:44?PM, Jens Axboe wrote: > > On 6/28/23 11:52?AM, Kent Overstreet wrote: > >> On Wed, Jun 28, 2023 at 10:57:02AM -0600, Jens Axboe wrote: > >>> I discussed this with Christian offline. I have a patch that is pretty > >>> simple, but it does mean that you'd wait for delayed fput flush off > >>> umount. Which seems kind of iffy. > >>> > >>> I think we need to back up a bit and consider if the kill && umount > >>> really is sane. If you kill a task that has open files, then any fput > >>> from that task will end up being delayed. This means that the umount may > >>> very well fail. > >>> > >>> It'd be handy if we could have umount wait for that to finish, but I'm > >>> not at all confident this is a sane solution for all cases. And as > >>> discussed, we have no way to even identify which files we'd need to > >>> flush out of the delayed list. > >>> > >>> Maybe the test case just needs fixing? Christian suggested lazy/detach > >>> umount and wait for sb release. There's an fsnotify hook for that, > >>> fsnotify_sb_delete(). Obviously this is a bit more involved, but seems > >>> to me that this would be the way to make it more reliable when killing > >>> of tasks with open files are involved. > >> > >> No, this is a real breakage. Any time we introduce unexpected > >> asynchrony there's the potential for breakage: case in point, there was > >> a filesystem that made rm asynchronous, then there were scripts out > >> there that deleted until df showed under some threshold.. whoops... > > > > This is nothing new - any fput done from an exiting task will end up > > being deferred. The window may be a bit wider now or a bit different, > > but it's the same window. If an application assumes it can kill && wait > > on a task and be guaranteed that the files are released as soon as wait > > returns, it is mistaken. That is NOT the case. > > Case in point, just changed my reproducer to use aio instead of > io_uring. Here's the full script: > > #!/bin/bash > > DEV=/dev/nvme1n1 > MNT=/data > ITER=0 > > while true; do > echo loop $ITER > sudo mount $DEV $MNT > fio --name=test --ioengine=aio --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --output=/dev/null & > Y=$(($RANDOM % 3)) > X=$(($RANDOM % 10)) > VAL="$Y.$X" > sleep $VAL > ps -e | grep fio > /dev/null 2>&1 > while [ $? -eq 0 ]; do > killall -9 fio > /dev/null 2>&1 > echo will wait > wait > /dev/null 2>&1 > echo done waiting > ps -e | grep "fio " > /dev/null 2>&1 > done > sudo umount /data > if [ $? -ne 0 ]; then > break > fi > ((ITER++)) > done > > and if I run that, fails on the first umount attempt in that loop: > > axboe@m1max-kvm ~> bash test2.sh > loop 0 > will wait > done waiting > umount: /data: target is busy. Your test fails because fio by default spawns off multiple processes, and just calling wait does not wait for the subprocesses. When I pass --thread to fio, your test passes. I have a patch to avoid use of the delayed_fput list in the aio path, but curiously it seems not to be needed - perhaps there's some other synchronization I haven't found yet. I'm including the patch below in case the technique is useful for io_uring: diff --git a/fs/aio.c b/fs/aio.c index b3e14a9fe3..00cb953efa 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -211,6 +211,7 @@ struct aio_kiocb { * for cancellation */ refcount_t ki_refcnt; + struct task_struct *ki_task; /* * If the aio_resfd field of the userspace iocb is not zero, * this is the underlying eventfd context to deliver events to. @@ -321,7 +322,7 @@ static void put_aio_ring_file(struct kioctx *ctx) ctx->aio_ring_file = NULL; spin_unlock(&i_mapping->private_lock); - fput(aio_ring_file); + __fput_sync(aio_ring_file); } } @@ -1068,6 +1069,7 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx) INIT_LIST_HEAD(&req->ki_list); refcount_set(&req->ki_refcnt, 2); req->ki_eventfd = NULL; + req->ki_task = get_task_struct(current); return req; } @@ -1104,8 +1106,9 @@ static inline void iocb_destroy(struct aio_kiocb *iocb) if (iocb->ki_eventfd) eventfd_ctx_put(iocb->ki_eventfd); if (iocb->ki_filp) - fput(iocb->ki_filp); + fput_for_task(iocb->ki_filp, iocb->ki_task); percpu_ref_put(&iocb->ki_ctx->reqs); + put_task_struct(iocb->ki_task); kmem_cache_free(kiocb_cachep, iocb); } diff --git a/fs/file_table.c b/fs/file_table.c index 372653b926..137f87f55e 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -367,12 +367,13 @@ EXPORT_SYMBOL_GPL(flush_delayed_fput); static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput); -void fput(struct file *file) +void fput_for_task(struct file *file, struct task_struct *task) { if (atomic_long_dec_and_test(&file->f_count)) { - struct task_struct *task = current; + if (!task && likely(!in_interrupt() && !(current->flags & PF_KTHREAD))) + task = current; - if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { + if (task) { init_task_work(&file->f_rcuhead, ____fput); if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME)) return; @@ -388,6 +389,11 @@ void fput(struct file *file) } } +void fput(struct file *file) +{ + fput_for_task(file, NULL); +} + /* * synchronous analog of fput(); for kernel threads that might be needed * in some umount() (and thus can't use flush_delayed_fput() without @@ -405,6 +411,7 @@ void __fput_sync(struct file *file) } } +EXPORT_SYMBOL(fput_for_task); EXPORT_SYMBOL(fput); EXPORT_SYMBOL(__fput_sync); diff --git a/include/linux/file.h b/include/linux/file.h index 39704eae83..667a68f477 100644 --- a/include/linux/file.h +++ b/include/linux/file.h @@ -12,7 +12,9 @@ #include <linux/errno.h> struct file; +struct task_struct; +extern void fput_for_task(struct file *, struct task_struct *); extern void fput(struct file *); struct file_operations;
On Thu, Jul 06, 2023 at 01:38:19PM -0400, Kent Overstreet wrote: > On Thu, Jul 06, 2023 at 12:40:55PM -0400, Josef Bacik wrote: > > I've been watching this from the sidelines sort of busy with other things, but I > > realize that comments I made at LSFMMBPF have been sort of taken as the gospel > > truth and I want to clear some of that up. > > > > I said this at LSFMMBPF, and I haven't said it on list before so I'll repeat it > > here. > > > > I'm of the opinion that me and any other outsider reviewing the bcachefs code in > > bulk is largely useless. I could probably do things like check for locking > > stuff and other generic things. > > Yeah, agreed. And the generic things - that's what we've got automated > testing for; there's a reason I've been putting so much effort into > automated testing over (especially) the past year. Woot. That's more than I can say for ntfs3... > > You have patches that are outside of fs/bcachefs. Get those merged and then do > > a pull with just fs/bcachefs, because again posting 90k loc is going to be > > unwieldy and the quality of review just simply will not make a difference. > > > > Alternatively rework your code to not have any dependencies outside of > > fs/bcachefs. This is what btrfs did. That merge didn't touch anything outside > > of fs/btrfs. > > We've had other people saying, at multiple times in the past, that > patches that are only needed for bcachefs should be part of the initial > pull instead of going in separately. > > I've already cut down the non-bcachefs pull quite a bit, even to the > point of making non-ideal engineering choices, and if I have to cut it > down more it's going to mean more ugly choices. <nod> > > This merge attempt has gone off the rails, for what appears to be a few common > > things. > > > > 1) The external dependencies. There's a reason I was really specific about what > > I said at LSFMMBPF, both this year and in 2022. Get these patches merged first, > > the rest will be easier. You are burning a lot of good will being combative > > with people over these dependencies. This is not the hill to die on. You want > > bcachefs in the kernel and to get back to bcachefs things. Make the changes you > > need to make to get these dependencies in, or simply drop the need for them and > > come back to it later after bcachefs is merged. > > Look, I'm not at all trying to be combative, I'm just trying to push > things forward. > > The one trainwreck-y thread was regarding vmalloc_exec(), and posting > that patch needed to happen in order to figure out what was even going > to happen regarding the dynamic codegen going forward. It's been dropped > from the initial pull, and dynamic codegen is going to wait on a better > executable memory allocation API. > > (and yes, that thread _was_ a trainwreck; it's not good when you have > maintainers claiming endlessly that something is broken and making > arguments to authority but _not able to explain why_. The thread on the > new executable memory allocator still needs something more concrete on > the issues with speculative execution from Andy or someone else). (Honestly I'm glad that's set aside for now, because it seemed like a giant can of worms for a non-critical optimization.) > Let me just lay out the non-bcachefs dependencies: > > - two lockdep patches: these could technically be dropped from the > series, but that would mean dropping lockdep support entirely for > btree node locks, and even Linus has said we need to get rid of > lockdep_no_validate_class so I'm hoping to avoid that. > > - six locks: this shouldn't be blocking, we can move them to > fs/bcachefs/ if Linus still feels they need more review - but Dave > Chinner was wanting them and the locking people disliked exporting > osq_lock so that's why I have them in kernel/locking. That's probably ok for now; we don't (AFAIK) have any concrete plan for deploying sixlocks in xfs at this time. Even if we did, it's still going to take another 15 years to review the ~2000 patches of backlog in djwong/dchinner trees. > - mean_and_variance: this is some statistics library code that computes > mean and standard deviation for time samples, both standard and > exponentially weighted. Again, bcachefs is the first user so this > pull request is the proper place for this code, and I'm intending to > convert bcache to this code as well as use it for other kernel wide > latency tracking (which I demod at LSF awhile back; I'll be posting > it again once code tagging is upstreamed as part of the memory > allocation work Suren and I are doing). TBH, so long as bcachefs is the only user of sixlocks and mean/variance, I don't really care what's in them, though they probably ought to live in fs/bcachefs/ until a second user arises. > - task_struct->faults_disabled_mapping: this adds a task_struct member > that makes it possible to do strict page cache coherency. > > This is something I intend to push into the VFS, but it's going to be > a big project - it needs a new type of lock (the one in bcachefs is > good enough for an initial implementation, but the real version > probably needs priority inheritence and other things). In the > meantime, I've thoroughly documented what's going on and what the > plan is in the commit message. > > - d_mark_tmpfile(): trivial new helper, from pulling out part of > d_tmpfile(). We need this because bcachefs handles the nlink count > for tmpfiles earlier, in the btree transaction. XFS might want this too, we also handle the nlink count for tmpfiles earlier, in a transaction, and end up playing stupid games with the refcount to fit the vfs function: if (tmpfile) { /* * The VFS requires that any inode fed to d_tmpfile must * have nlink == 1 so that it can decrement the nlink in * d_tmpfile. However, we created the temp file with * nlink == 0 because we're not allowed to put an inode * with nlink > 0 on the unlinked list. Therefore we * have to set nlink to 1 so that d_tmpfile can * immediately set it back to zero. */ set_nlink(inode, 1); d_tmpfile(tmpfile, inode); } > - copy_folio_from_iter_atomic(): obvious new helper, other filesystems > will want this at some point as part of the ongoing folio conversion > > - block layer patches: we have > > - new exports: primarily because bcachefs has its own dio path and > does not use iomap, also blk_status_to_str() for better error > messages > > - bio_iov_iter_get_pages() with bio->bi_bdev unset: bcachefs builds > up bios before we know which block device those bios will be > issued to. > > There was something thrown out about "bi_bdev being required" - but > that doesn't make a lot of sense here. The direction in the block > layer back when I made it sane for stacking block drivers - i.e. > enabling efficient splitting/cloning of bios - was towards bios > being more just simple iterators over a scatter/gather list, and > now we've got iov_iter which can point at a bio/bvec array - moving > even more in that direction. > > Regardless, this patch is pretty trivial, it's not something that > commits us to one particular approach. bio_iov_iter_get_pages() is > here trying to return bios that are aligned to the block device's > blocksize, but in our case we just want it aligned to the > filesystem's blocksize. <shrug> seems fine to me... > - bring back zero_fill_bio_iter() - I originally wrote this, > Christoph deleted it without checking. It's just a more general > version of zero_fill_bio(). > > - Don't block on s_umount from __invalidate_super: this is a bugfix > for a deadlock in generic/042 because of how we use sget(), the > commit message goes into more detail. If this is in reference to the earlier subthread about some io_uring thing causing unmount to hang -- my impressions of that were that yes, it's a bug, but no, it's not a bug in bcachefs itself. I also wondered why (a) that hadn't split out as its own thread; and (b) is this really a bcachefs blocker? /me shrugs, been on vacation and in hospitals for the last month or so. > bcachefs doesn't use sget() for mutual exclusion because a) sget() > is insane, what we really want is the _block device_ to be opened > exclusively (which we do), and we have our own block device opening > path - which we need to, as we're a multi device filesystem. ...and isn't jan kara already messing with this anyway? > - generic radix tree fixes: this is just fixes for code I already wrote > for bcachefs and upstreamed previously, after converting existing > users of flex-array. > > - move closures to lib/ - this is also code I wrote, now needs to be > shared with bcache <nod> > - small stuff: > - export stack_trace_save_stack() - this is used for displaying stack > traces in debugfs > - export errname() - better error messages > - string_get_size() - change it to return number of characters written > - compiler attributes - add __flatten > > If there are objections to any of these patches, please _be specific_. > Please remember that I am also incorporating feedback previous > discussions, and a generic "these patches need to go in separately" is > not something I can do anything with, as explained previously. > > > 2) We already have recent examples of merge and disappear. Yes of course you've > > been around for a long time, you aren't the NTFS developers. But as you point > > out it's 90k of code. When btrfs was merged there were 3 large contributors, > > Chris, myself, and Yanzheng. If Chris got hit by a bus we could still drive the > > project forward. Can the same be said for bachefs? The same can't even be said about ext4 or xfs -- if Ted, Dave, or I disappeared tomorrow, I predict there would be huge problems within a month or two. I'm of two minds here -- I want to say that bcachefs should get merged because wasting Kent's mind on rebasing out of tree patchsets is totally stupid and I think I've worked over his QA/CI system enough to trust that bcachefs isn't a giant nightmare codebase. OTOH there's so many layers of tiny helper functions and macros that I have a really hard time making sense of all those pre-bcachefs changes. That's why I haven't weighed in. Given all the weird problems we've had recently with new code colliding badly with under-documented optimized core code, I'm fearful of touching anything. > > I know others have chimed > > in and done some stuff, but as it's been stated elsewhere it would be good to > > have somebody else in the MAINTAINERS file with you. > > Yes, the bcachefs project needs to grow in terms of developers. The > unfortunate reality is that right now is a _hard_ time to growing teams > and budgets in this area; it's been an uphill battle. Same here. Sillycon Valley just laid off what, like 300,000 engineers so they could refocus on "AI" but they can't pay for 30 more people to work on filesystems??</rant> > You, the btrfs developers, got started when Linux filesystem teams were > quite a bit bigger than they are now: I was at Google when Google had a > bunch of people working on ext4, and that was when ZFS had recently come > out and there was recognition that Linux needed an answer to ZFS and you > were able to ride that excitement. It's been a bit harder for me to get > something equally ambitions going, to be honest. > > But years ago when I realized I was onto something, I decided this > project was only going to fail if I let it fail - so I'm in it for the > long haul. > > Right now what I'm hearing, in particular from Redhat, is that they want > it upstream in order to commit more resources. Which, I know, is not > what kernel people want to hear, but it's the chicken-and-the-egg > situation I'm in. /me suspects mainline merging is necessary but not sufficient -- few non-developers want to deal with merging an out of tree filesystem, but that still doesn't mean anyone will commit real engineering resources. > > I am really, really wanting you to succeed here Kent. If the general consensus > > is you need to have some idiot review fs/bcachefs I will happily carve out some > > time and dig in. > > That would be much appreciated - I'll owe you some beers next time I see > you. But before jumping in, let's see if we can get people who have > already worked with the code to say something. > > Something I've done in the past that might be helpful - instead (or in > addition to) having people read code in isolation, perhaps we could do a > group call/meeting where people can ask questions about the code, bring > up design issues they've seen in other filesystems, etc. - I've also > found that kind of setup great for identifying places in the code where > additional documentation would be useful. "At this point I think Kent's QA efforts are at least as good as XFS's, just merge it and let's move on to the next thing." --D
On Thu, Jul 06, 2023 at 02:19:14PM -0700, Darrick J. Wong wrote: > TBH, so long as bcachefs is the only user of sixlocks and mean/variance, > I don't really care what's in them, though they probably ought to live > in fs/bcachefs/ until a second user arises. I've been waiting for Linus to weigh in on those (and the rest of the merge) since he had opinions a few weeks ago, but I have no real objection there. I'd need to add an export for osq_lock, that's all. > > - d_mark_tmpfile(): trivial new helper, from pulling out part of > > d_tmpfile(). We need this because bcachefs handles the nlink count > > for tmpfiles earlier, in the btree transaction. > > XFS might want this too, we also handle the nlink count for tmpfiles > earlier, in a transaction, and end up playing stupid games with the > refcount to fit the vfs function: > > if (tmpfile) { > /* > * The VFS requires that any inode fed to d_tmpfile must > * have nlink == 1 so that it can decrement the nlink in > * d_tmpfile. However, we created the temp file with > * nlink == 0 because we're not allowed to put an inode > * with nlink > 0 on the unlinked list. Therefore we > * have to set nlink to 1 so that d_tmpfile can > * immediately set it back to zero. > */ > set_nlink(inode, 1); > d_tmpfile(tmpfile, inode); > } Yeah, that same game would technically work for bcachefs - but I'm hoping we can just do the right thing here :) > > - Don't block on s_umount from __invalidate_super: this is a bugfix > > for a deadlock in generic/042 because of how we use sget(), the > > commit message goes into more detail. > > If this is in reference to the earlier subthread about some io_uring > thing causing unmount to hang -- my impressions of that were that yes, > it's a bug, but no, it's not a bug in bcachefs itself. I also wondered > why (a) that hadn't split out as its own thread; and (b) is this really > a bcachefs blocker? No, this is completely unrelated. The io_uring thing hits on generic/388 (and others) and just causes umount to fail with -EBUSY. This one is an actual deadlock and it hits every time in generic/042. It's specific to the loopback device and when it emits certain events, and it hits every time so I really do need this fix included. > /me shrugs, been on vacation and in hospitals for the last month or so. > > > bcachefs doesn't use sget() for mutual exclusion because a) sget() > > is insane, what we really want is the _block device_ to be opened > > exclusively (which we do), and we have our own block device opening > > path - which we need to, as we're a multi device filesystem. > > ...and isn't jan kara already messing with this anyway? The blkdev_get_handle() patchset? I like that, but I don't think that's related - if there's something more related to sget() I haven't seen it yet > OTOH there's so many layers of tiny helper functions and macros that I > have a really hard time making sense of all those pre-bcachefs changes. > That's why I haven't weighed in. Given all the weird problems we've had > recently with new code colliding badly with under-documented optimized > core code, I'm fearful of touching anything. ??? not sure what you're referring to here, are there specific patches or recent issues you're thinking of? I don't think any of the non-fs/bcachefs/ patches are remotely tricky enough for any of this to be a concern. > > You, the btrfs developers, got started when Linux filesystem teams were > > quite a bit bigger than they are now: I was at Google when Google had a > > bunch of people working on ext4, and that was when ZFS had recently come > > out and there was recognition that Linux needed an answer to ZFS and you > > were able to ride that excitement. It's been a bit harder for me to get > > something equally ambitions going, to be honest. > > > > But years ago when I realized I was onto something, I decided this > > project was only going to fail if I let it fail - so I'm in it for the > > long haul. > > > > Right now what I'm hearing, in particular from Redhat, is that they want > > it upstream in order to commit more resources. Which, I know, is not > > what kernel people want to hear, but it's the chicken-and-the-egg > > situation I'm in. > > /me suspects mainline merging is necessary but not sufficient -- few > non-developers want to deal with merging an out of tree filesystem, but > that still doesn't mean anyone will commit real engineering resources. Yeah, no doubt it will continue to be an uphill battle. But it's a necessary step in the right direction, for sure. > > > I am really, really wanting you to succeed here Kent. If the general consensus > > > is you need to have some idiot review fs/bcachefs I will happily carve out some > > > time and dig in. > > > > That would be much appreciated - I'll owe you some beers next time I see > > you. But before jumping in, let's see if we can get people who have > > already worked with the code to say something. > > > > Something I've done in the past that might be helpful - instead (or in > > addition to) having people read code in isolation, perhaps we could do a > > group call/meeting where people can ask questions about the code, bring > > up design issues they've seen in other filesystems, etc. - I've also > > found that kind of setup great for identifying places in the code where > > additional documentation would be useful. > > "At this point I think Kent's QA efforts are at least as good as XFS's, > just merge it and let's move on to the next thing." high praise :)
On Thu, Jul 06, 2023 at 01:38:19PM -0400, Kent Overstreet wrote: > You, the btrfs developers, got started when Linux filesystem teams were > quite a bit bigger than they are now: I was at Google when Google had a > bunch of people working on ext4, and that was when ZFS had recently come > out and there was recognition that Linux needed an answer to ZFS and you > were able to ride that excitement. It's been a bit harder for me to get > something equally ambitions going, to be honest. Just to set the historical record straight, I think you're mixing up two stories here. *Btrfs* was started while I was at the IBM Linux Technology Center, and it was because there were folks from more than one companies that were concerned that there needed to be an answer to ZFS. IBM hosted that meeting, but ultimately, never did contribute any developers to the btrfs effort. That's because IBM had a fairly cold, hard examination of what their enterprise customers really wanted, and would be willing to pay $$$, and the decision was made at a corporate level (higher up than the Linux Technology Center, although I participated in the company-wide investigation) that *none* of OS's that IBM supported (AIX, zOS, Linux, etc.) needed ZFS-like features, because IBM's customers didn't need them. The vast majority of what paying customers' workloads at the time was to run things like Websphere, and Oracle and DB/2, and these did not need fancy snapshots. And things like integrity could be provided at other layers of the storage stack. As far as Google was concerned, yes, we had several software engineers working on ext4, but it had nothing to do with ZFS. We had a solid business case for how replacing ext2 with ext4 (in nojournal mode, since the cluster file system handled data integrity and crash recovery) would save the company $XXX millions of dollars in storage TCO (total cost of ownership) dollars per year. In any case, at neither company was a "sense of excitement" something which drove the technical decisions. It was all about Return on Investment (ROI). As such, that's driven my bias towards ext4 maintenance. I view part of my job is finding matches between interesting file system features that I would find technically interesting, and which would benefit the general ext4 user base, and specific business cases that would encourage the investment of several developers on file system technologies. Things like case insensitive file names, fscrypt, fsverity, etc., where all started *after* I had found a business case that would interest one or more companies or divisions inside Google to put people on the project. Smaller projects can get funded on the margins, sure. But for anything big, that might require the focused attention of one or more developers for a quarter or more, I generally find the business case first, and often, that will inform the requirements for the feature. In other words, not only am I ext4's maintainer, I'm also its product manager. Of course, this is not the only way you can drive technology forward. For example, at Sun Microsystems, ZFS was driven just by the techies, and initially, they hid the fact that the project was taking place, not asking the opinion of the finance and sales teams. And so ZFS had quite a lot of very innovative technologies that pushed the industry forward, including inspiring btrfs. Of course, Sun Microsystems didn't do all that well financially, until they were forced to sell themselves to the highest bidder. So perhaps, it might be that this particular model is one that other companies, including IBM, Red Hat, Microsoft, Oracle, Facebook, etc., might choose to avoid emulating. Cheers, - Ted
> just merge it and let's move on to the next thing."
"and let the block and vfs maintainers and developers deal with the fallout"
is how that reads to others that deal with 65+ filesystems and counting.
The offlist thread that was started by Kent before this pr was sent has
seen people try to outline calmly what problems they currently still
have both maintenance wise and upstreaming wise. And it seems there's
just no way this can go over calmly but instead requires massive amounts
of defensive pushback and grandstanding.
Our main task here is to consider the concerns of people that constantly
review and rework massive amounts of generic code. And I can't in good
conscience see their concerns dismissed with snappy quotes.
I understand the impatience, I understand the excitement, I really do.
But not in this way where core people just drop off because they don't
want to deal with this anymore.
I've spent enough time on this thread.
On Fri, Jul 07, 2023 at 10:48:55AM +0200, Christian Brauner wrote: > > just merge it and let's move on to the next thing." > > "and let the block and vfs maintainers and developers deal with the fallout" > > is how that reads to others that deal with 65+ filesystems and counting. > > The offlist thread that was started by Kent before this pr was sent has > seen people try to outline calmly what problems they currently still > have both maintenance wise and upstreaming wise. And it seems there's > just no way this can go over calmly but instead requires massive amounts > of defensive pushback and grandstanding. > > Our main task here is to consider the concerns of people that constantly > review and rework massive amounts of generic code. And I can't in good > conscience see their concerns dismissed with snappy quotes. > > I understand the impatience, I understand the excitement, I really do. > But not in this way where core people just drop off because they don't > want to deal with this anymore. > > I've spent enough time on this thread. Christain, the hostility I'm reading is in your steady passive aggressive accusations, and your patronizing attitude. It's not professional, and it's not called for. Can we please try to stay focused on the code, and the process, and the _actual_ concerns? In that offlist thread, I don't recall much in the way of actual, concrete concerns. I do recall Christoph doing his usual schpiel; and to be clear, I cut short my interactions with Christoph because in nearly 15 years of kernel development he's never been anything but hostile to anything I've posted, and the criticisms he posts tend to be vague and unaware of the surrounding discussion, not anything actionable. The most concrete concern from you in that offlist thread was "we don't want a repeat of ntfs", and when I asked you to elaborate you never responded. Huh. And: this pull request is not some sudden thing, I have been steadily feeding prep work patches in and having ongoing discussions with other filesystem people, including presenting at LSF to gather feedback, since _well_ before you were the VFS maintainer. If you have anything concrete to share, any concrete concerns you'd like addressed - please share them! I'd love to work with you. I don't want the two of us to have a hostile, adversarial relationship; I appreciate the work you've been doing in the vfs, and I've told you that in the past. But it would help things if you would try to work with me, not against me, and try to understand that there's been past discussions and consensus that was built before you came along. Cheers, Kent
On Fri, Jul 07, 2023 at 10:48:55AM +0200, Christian Brauner wrote: > > just merge it and let's move on to the next thing." > > "and let the block and vfs maintainers and developers deal with the fallout" > > is how that reads to others that deal with 65+ filesystems and counting. > > The offlist thread that was started by Kent before this pr was sent has > seen people try to outline calmly what problems they currently still > have both maintenance wise and upstreaming wise. And it seems there's > just no way this can go over calmly but instead requires massive amounts > of defensive pushback and grandstanding. > > Our main task here is to consider the concerns of people that constantly > review and rework massive amounts of generic code. And I can't in good > conscience see their concerns dismissed with snappy quotes. > > I understand the impatience, I understand the excitement, I really do. > But not in this way where core people just drop off because they don't > want to deal with this anymore. > > I've spent enough time on this thread. Also, if you do feel like coming back to the discussion: I would still like to hear in more detail about your specific pain points and talk about what we can do to address them. I've put a _ton_ of work into test infrastructure over the years, and it's now scalable enough to handle fstests runs on every filesystem fstests support - and it'll get you the results in short order. I've started making the cluster available to other devs, and I'd be happy to make it available to you as well. Perhaps there's other things we could do. Cheers, Kent
On Thu, Jul 06, 2023 at 01:38:19PM -0400, Kent Overstreet wrote: > On Thu, Jul 06, 2023 at 12:40:55PM -0400, Josef Bacik wrote: ... > > I am really, really wanting you to succeed here Kent. If the general consensus > > is you need to have some idiot review fs/bcachefs I will happily carve out some > > time and dig in. > > That would be much appreciated - I'll owe you some beers next time I see > you. But before jumping in, let's see if we can get people who have > already worked with the code to say something. > I've been poking at bcachefs for several months or so now. I'm happy to chime in on my practical experience thus far, though I'm still not totally clear what folks are looking for on this front, in terms of actual review. I agree with Josef's sentiment that a thorough code review of the entire fs is not really practical. I've not done that and don't plan to in the short term. As it is, I have been able to dig into various areas of the code, learn some of the basic principles, diagnose/fix issues and get some of those fixes merged without too much trouble. IMO, the code is fairly well organized at a high level, reasonably well documented and debuggable/supportable. That isn't to say some of those things couldn't be improved (and I expect they will be), but these are more time and resource constraints than anything and so I don't see any major red flags in that regard. Some of my bigger personal gripes would be a lot of macro code generation stuff makes it a bit harder (but not impossible) for a novice to come up to speed, and similarly a bit more introductory/feature level documentation would be useful to help navigate areas of code without having to rely on Kent as much. The documentation that is available is still pretty good for gaining a high level understanding of the fs data structures, though I agree that more content on things like on-disk format would be really nice. Functionality wise I think it's inevitable that there will be some growing pains as user and developer base grows. For that reason I think having some kind of experimental status for a period of time is probably the right approach. Most of the issues I've dug into personally have been corner case type things, but experience shows that these are the sorts of things that eventually arise with more users. We've also briefly discussed things like whether bcachefs could take more advantage of some of the test coverage that btrfs already has in fstests, since the feature sets should largely overlap. That is TBD, but is something else that might be a good step towards further proving out reliability. Related to that, something I'm not sure I've seen described anywhere is the functional/production status of the filesystem itself (not necessarily the development status of the various features). For example, is the filesystem used in production at any level? If so, what kinds of deployments, workloads and use cases do you know about? How long have they been in use, etc.? I realize we may not have knowledge or permission to share details, but any general info about usage in the wild would be interesting. The development process is fairly ad hoc, so I suspect that is something that would have to evolve if this lands upstream. Kent, did you have thoughts/plans around that? I don't mind contributing reviews where I can, but that means patches would be posted somewhere for feedback, etc. I suppose that has potential to slow things down, but also gives people a chance to see what's happening, review or ask questions, etc., which is another good way to learn or simply keep up with things. All in all I pretty much agree with Josef wrt to the merge request. ISTM the main issues right now are the external dependencies and development/community situation (i.e. bus factor). As above, I plan to continue contributions at least in terms of fixes and whatnot so long as $employer continues to allow me to dedicate at least some time to it and the community is functional ;), but it's not clear to me if that is sufficient to address the concerns here. WRT the dependencies, I agree it makes sense to be deliberate and for anything that is contentious, either just drop it or lift it into bcachefs for now to avoid the need to debate on these various fronts in the first place (and simplify the pull request as much as possible). With those issues addressed, perhaps it would be helpful if other interested fs maintainers/devs could chime in with any thoughts on what they'd want to see in order to ack (but not necessarily "review") a new filesystem pull request..? I don't have the context of the off list thread, but from this thread ISTM that perhaps Josef and Darrick are close to being "soft" acks provided the external dependencies are worked out. Christoph sent a nak based on maintainer status. Kent, you can add me as a reviewer if 1. you think that will help and 2. if you plan to commit to some sort of more formalized development process that will facilitate review..? I don't know if that means an ack from Christoph, but perhaps it addresses the nak. I don't really expect anybody to review the entire codebase, but obviously it's available for anybody who might want to dig into certain areas in more detail.. Brian
On Thu 06-07-23 18:43:14, Kent Overstreet wrote: > On Thu, Jul 06, 2023 at 02:19:14PM -0700, Darrick J. Wong wrote: > > /me shrugs, been on vacation and in hospitals for the last month or so. > > > > > bcachefs doesn't use sget() for mutual exclusion because a) sget() > > > is insane, what we really want is the _block device_ to be opened > > > exclusively (which we do), and we have our own block device opening > > > path - which we need to, as we're a multi device filesystem. > > > > ...and isn't jan kara already messing with this anyway? > > The blkdev_get_handle() patchset? I like that, but I don't think that's > related - if there's something more related to sget() I haven't seen it > yet There's a series on top of that that also modifies how sget() works [1]. Christian wants that bit to be merged separately from the bdev handle stuff and Christoph chimed in with some other related cleanups so he'll now take care of that change. Anyhow we should have sget() that does not exclusively claim the bdev unless it needs to create a new superblock soon. Honza [1] https://lore.kernel.org/all/20230704125702.23180-6-jack@suse.cz
On Fri, Jul 07, 2023 at 03:13:06PM +0200, Jan Kara wrote: > On Thu 06-07-23 18:43:14, Kent Overstreet wrote: > > On Thu, Jul 06, 2023 at 02:19:14PM -0700, Darrick J. Wong wrote: > > > /me shrugs, been on vacation and in hospitals for the last month or so. > > > > > > > bcachefs doesn't use sget() for mutual exclusion because a) sget() > > > > is insane, what we really want is the _block device_ to be opened > > > > exclusively (which we do), and we have our own block device opening > > > > path - which we need to, as we're a multi device filesystem. > > > > > > ...and isn't jan kara already messing with this anyway? > > > > The blkdev_get_handle() patchset? I like that, but I don't think that's > > related - if there's something more related to sget() I haven't seen it > > yet > > There's a series on top of that that also modifies how sget() works [1]. > Christian wants that bit to be merged separately from the bdev handle stuff > and Christoph chimed in with some other related cleanups so he'll now take > care of that change. > > Anyhow we should have sget() that does not exclusively claim the bdev > unless it needs to create a new superblock soon. Thanks for the link sget() felt a bit odd in bcachefs because we have our own bch2_fs_open() path that, completely separately from the VFS opens a list of block devices and returns a fully constructed filesystem handle. We need this because it's also used in userspace, where we don't have the VFS and it wouldn't make much sense to lift sget(), for e.g. fsck and other tools. IOW, we really do need to own the whole codepath that opens the actual block devices; our block device open path does things like parse the opts struct to decide whether to open the block device in write mode or exclusive mode... So the way around this in bcachefs is we call sget() twice - first in "find an existing sb but don't create one" mode, then if that fails we call bch2_fs_open() and call sget() again in "create a super_block and attach it to this bch_fs" - a bit awkward but it works. Not sure if this has come up in other filesystems, but here's the relevant bcachefs code: https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs.c#n1756
On Fri, Jul 07, 2023 at 08:18:28AM -0400, Brian Foster wrote: > As it is, I have been able to dig into various areas of the code, learn > some of the basic principles, diagnose/fix issues and get some of those > fixes merged without too much trouble. IMO, the code is fairly well > organized at a high level, reasonably well documented and > debuggable/supportable. That isn't to say some of those things couldn't > be improved (and I expect they will be), but these are more time and > resource constraints than anything and so I don't see any major red > flags in that regard. Some of my bigger personal gripes would be a lot > of macro code generation stuff makes it a bit harder (but not > impossible) for a novice to come up to speed, Yeah, we use x-macros extensively for e.g. enums so we can also generate string arrays. Wonderful for the to_text functions, annoying for breaking ctags/cscope. > and similarly a bit more > introductory/feature level documentation would be useful to help > navigate areas of code without having to rely on Kent as much. The > documentation that is available is still pretty good for gaining a high > level understanding of the fs data structures, though I agree that more > content on things like on-disk format would be really nice. A thought I'd been meaning to share anyways: when there's someone new getting up to speed on the codebase, I like to use it as an opportunity to write documentation. If anyone who's working on the code asks for a section of code to be documented - just tell me what you're looking at and give me an idea of what your questions are and I'll write out a patch adding kdoc comments. For me, this is probably the lowest stress way to get code documentation written, and that way it's added to the code for the next person too. In the past we've also done meetings where we looked at source code together and I walked people through various codepaths - I think those were effective at getting people some grounding, and especially if there's more people interested I'd be happy to do that again. Also, I'm pretty much always on IRC - don't hesitate to use me as a resource! > Functionality wise I think it's inevitable that there will be some > growing pains as user and developer base grows. For that reason I think > having some kind of experimental status for a period of time is probably > the right approach. Most of the issues I've dug into personally have > been corner case type things, but experience shows that these are the > sorts of things that eventually arise with more users. We've also > briefly discussed things like whether bcachefs could take more advantage > of some of the test coverage that btrfs already has in fstests, since > the feature sets should largely overlap. That is TBD, but is something > else that might be a good step towards further proving out reliability. Yep, bcachefs implements essentially the same basic user interface for subvolumes/snapshots, and test coverage for snapshots is an area where we're still somewhat weak. > Related to that, something I'm not sure I've seen described anywhere is > the functional/production status of the filesystem itself (not > necessarily the development status of the various features). For > example, is the filesystem used in production at any level? If so, what > kinds of deployments, workloads and use cases do you know about? How > long have they been in use, etc.? I realize we may not have knowledge or > permission to share details, but any general info about usage in the > wild would be interesting. I don't have any hard numbers, I can only try to infer. But it has been used in production (by paying customers) at least a few sites for several years; I couldn't say how many because I only find out when something breaks :) In the wider community, it's at least hundreds, likely thousands based on distinct users reporting bugs, the ammount of hammering on my git server since it got packaged in nixos, etc. There's users in the IRC channel who've been running it on multiple machines for probably 4-5 years, and generally continuously upgrading them (I've never done an on disk format change that required a mkfs); I've been running it on my laptops for about that long as well. Based on the types of bug reports I've been seeing, things have been stabilizing quite nicely - AFAIK no one's losing data; we do have some outstanding filesystem corruption bugs but they're little things that fsck can easily repair and don't lead to data loss (e.g. the erasure coding tests are complaining about disk space utilization counters being wrong, some of our tests are still finding the occasional backpointers bug - Brian just started looking at that one :) The exception is snapshots, there's a user in China who's been throwing crazy database workloads at bcachefs - that's still seeing some data corruption (he sent me a filesystem metadata dump with 2k snapshots that hit O(n^3) algorithms in fsck, fixes for that are mostly done) - once I get back to that work and doing more proper torture testing that should be ironed out soon, we know where the issue is now. > The development process is fairly ad hoc, so I suspect that is something > that would have to evolve if this lands upstream. Kent, did you have > thoughts/plans around that? I don't mind contributing reviews where I > can, but that means patches would be posted somewhere for feedback, etc. > I suppose that has potential to slow things down, but also gives people > a chance to see what's happening, review or ask questions, etc., which > is another good way to learn or simply keep up with things. Yeah, that's a good discussion. I wouldn't call the current development process "ad hoc", it's the process that's evolved to let me write code the fastest without making users unhappy :) and that mostly revolves around good test infrastructure, and a well structured codebase with excellent assertions so that we can make changes with high confidence that if the tests pass it's good. Regarding code review: We do need to do more of that going forward, and probably talk about what's most comfortable for people, but I'm also not a big fan of how I see code review typically happening on the kernel mailing lists and I want to try to push things in a different direction for bcachefs. In my opinion, the way we tend to do code review tends to be on the very fastidious/nitpicky side of things; and this made sense historically when kernel testing was _hard_ and we depended a lot more on human review to catch errors. But the downside of that kind of code review is it's a big time sink, and it burns people out (both the reviewers, and the people who are trying to get reviews!) - and when the discussion is mostly about nitpicky things, that takes away energy that could be going into the high level "what do we want to do and what ideas do we have for how to get there" discussions. When we're doing code review for bcachefs, I don't want to see people nitpicking style and complaining about the style of if statements, and I don't want people poring over every detail trying to catch bugs that our test infrastructure will catch. Instead, save that energy for: - identifying things that are legitimately tricky or have a high probability of introducing errors that won't be caught by tests: that's something we do want to talk about, that's being proactive - looking at the code coverage analysis to see where we're missing tests (we have automated code coverage analysis now!) - making sure changes are sane and _understandable_ - and just keeping abreast of each other's work. We don't need to get every detail, just the gist so we can understand each other's goals. The interactions in engineering teams that I've found to be the most valuable has never been code review, it's the more abstract discussions that happen _after_ we all understand what we're trying to do. That's what I want to see more of. Now, getting back to "how are we going to do code review" discussion - I personally prefer to do code review over IRC with a link to their git repository; I find a conversational format and quick feedback to be very valuable (I do not want people blocked because they're waiting on code review). But the mailing list sees a wider audience, so I see no reason why we can't start sending all our patches to the linux-bcachefs mailing list as well. Regarding the "more abstract, what are we trying to do" discussions: I'm currently hosting a bcachefs cabal meeting every other week, and I might bump it up to once a week soon - email me if you'd like an invite, the wider fs community is definitely meant to be included. I've also experimented in the past with an open voice chat/conference call (hosted via the matrix channel); most of us aren't in office environments anymore, but the shared office with people working on similar things was great for quickly getting people up to speed, and the voice chat seemed to work well for that - I'm inclined to start doing that again. > All in all I pretty much agree with Josef wrt to the merge request. ISTM > the main issues right now are the external dependencies and > development/community situation (i.e. bus factor). As above, I plan to > continue contributions at least in terms of fixes and whatnot so long as > $employer continues to allow me to dedicate at least some time to it and > the community is functional ;), but it's not clear to me if that is > sufficient to address the concerns here. WRT the dependencies, I agree > it makes sense to be deliberate and for anything that is contentious, > either just drop it or lift it into bcachefs for now to avoid the need > to debate on these various fronts in the first place (and simplify the > pull request as much as possible). I'd hoped we can table the discussion on "dependencies" in the abstract. Prior consensus, from multiple occasions when I was feeding in bcachefs prep work, was that patches that were _only_ needed for bcachefs should be part of the bcachefs pull request - that's what I've been sticking to. Slimming down the dependencies any further will require non-ideal engineering tradeoffs, so any request/suggestion to do so needs to come with some specifics. And Jens already ok'd the 4 block patches, which were the most significant. > With those issues addressed, perhaps it would be helpful if other > interested fs maintainers/devs could chime in with any thoughts on what > they'd want to see in order to ack (but not necessarily "review") a new > filesystem pull request..? I don't have the context of the off list > thread, but from this thread ISTM that perhaps Josef and Darrick are > close to being "soft" acks provided the external dependencies are worked > out. Christoph sent a nak based on maintainer status. Kent, you can add > me as a reviewer if 1. you think that will help and 2. if you plan to > commit to some sort of more formalized development process that will > facilitate review..? That sounds agreeable :)
On Fri, 2023-07-07 at 05:18 -0400, Kent Overstreet wrote: > On Fri, Jul 07, 2023 at 10:48:55AM +0200, Christian Brauner wrote: > > > just merge it and let's move on to the next thing." > > > > "and let the block and vfs maintainers and developers deal with the > > fallout" > > > > is how that reads to others that deal with 65+ filesystems and > > counting. > > > > The offlist thread that was started by Kent before this pr was sent > > has seen people try to outline calmly what problems they currently > > still have both maintenance wise and upstreaming wise. And it seems > > there's just no way this can go over calmly but instead requires > > massive amounts of defensive pushback and grandstanding. > > > > Our main task here is to consider the concerns of people that > > constantly review and rework massive amounts of generic code. And I > > can't in good conscience see their concerns dismissed with snappy > > quotes. > > > > I understand the impatience, I understand the excitement, I really > > do. But not in this way where core people just drop off because > > they don't want to deal with this anymore. > > > > I've spent enough time on this thread. > > Christain, the hostility I'm reading is in your steady passive > aggressive accusations, and your patronizing attitude. It's not > professional, and it's not called for. Can you not see that saying this is a huge red flag? With you every disagreement becomes, as Josef said, "a hill to die on" and you then feel entitled to indulge in ad hominem attacks, like this, or be dismissive or try to bury whoever raised the objection in technical minutiae in the hope you can demonstrate you have a better grasp of the details than they do and therefore their observation shouldn't count. One of a maintainer's jobs is to nurture and build a community and that's especially important at the inclusion of a new feature. What we've seen from you implies you'd build a community of little Kents (basically an echo chamber of people who agree with you) and use them as a platform to attack any area of the kernel you didn't agree with technically (which, apparently, would be most of block and vfs with a bit of mm thrown in), leading to huge divisions and infighting. Anyone who had the slightest disagreement with you would be out and would likely behave in the same way you do now leading to internal community schisms and more fighting on the lists. We've spent years trying to improve the lists and make the community welcoming. However technically brilliant a new feature is, it can't come with this sort of potential for community and reputational damage. > Can we please try to stay focused on the code, and the process, and > the _actual_ concerns? > > In that offlist thread, I don't recall much in the way of actual, > concrete concerns. I do recall Christoph doing his usual schpiel; and > to be clear, I cut short my interactions with Christoph because in > nearly 15 years of kernel development he's never been anything but > hostile to anything I've posted, and the criticisms he posts tend to > be vague and unaware of the surrounding discussion, not anything > actionable. This too is a red flag. Working with difficult people is one of a maintainer's jobs as well. Christoph has done an enormous amount of highly productive work over the years. Sure, he's prickly and sure there have been fights, but everyone except you seems to manage to patch things up and accept his contributions. If it were just one personal problem it might be overlookable, but you seem to be having major fights with the maintainer of every subsystem you touch... James
On Fri, Jul 07, 2023 at 12:26:19PM -0400, James Bottomley wrote: > On Fri, 2023-07-07 at 05:18 -0400, Kent Overstreet wrote: > > On Fri, Jul 07, 2023 at 10:48:55AM +0200, Christian Brauner wrote: > > > > just merge it and let's move on to the next thing." > > > > > > "and let the block and vfs maintainers and developers deal with the > > > fallout" > > > > > > is how that reads to others that deal with 65+ filesystems and > > > counting. > > > > > > The offlist thread that was started by Kent before this pr was sent > > > has seen people try to outline calmly what problems they currently > > > still have both maintenance wise and upstreaming wise. And it seems > > > there's just no way this can go over calmly but instead requires > > > massive amounts of defensive pushback and grandstanding. > > > > > > Our main task here is to consider the concerns of people that > > > constantly review and rework massive amounts of generic code. And I > > > can't in good conscience see their concerns dismissed with snappy > > > quotes. > > > > > > I understand the impatience, I understand the excitement, I really > > > do. But not in this way where core people just drop off because > > > they don't want to deal with this anymore. > > > > > > I've spent enough time on this thread. > > > > Christain, the hostility I'm reading is in your steady passive > > aggressive accusations, and your patronizing attitude. It's not > > professional, and it's not called for. > > Can you not see that saying this is a huge red flag? With you every > disagreement becomes, as Josef said, "a hill to die on" and you then > feel entitled to indulge in ad hominem attacks, like this, or be > dismissive or try to bury whoever raised the objection in technical > minutiae in the hope you can demonstrate you have a better grasp of the > details than they do and therefore their observation shouldn't count. > > One of a maintainer's jobs is to nurture and build a community and > that's especially important at the inclusion of a new feature. What > we've seen from you implies you'd build a community of little Kents > (basically an echo chamber of people who agree with you) and use them > as a platform to attack any area of the kernel you didn't agree with > technically (which, apparently, would be most of block and vfs with a > bit of mm thrown in), leading to huge divisions and infighting. Anyone > who had the slightest disagreement with you would be out and would > likely behave in the same way you do now leading to internal community > schisms and more fighting on the lists. > > We've spent years trying to improve the lists and make the community > welcoming. However technically brilliant a new feature is, it can't > come with this sort of potential for community and reputational damage. > > > Can we please try to stay focused on the code, and the process, and > > the _actual_ concerns? > > > > In that offlist thread, I don't recall much in the way of actual, > > concrete concerns. I do recall Christoph doing his usual schpiel; and > > to be clear, I cut short my interactions with Christoph because in > > nearly 15 years of kernel development he's never been anything but > > hostile to anything I've posted, and the criticisms he posts tend to > > be vague and unaware of the surrounding discussion, not anything > > actionable. > > This too is a red flag. Working with difficult people is one of a > maintainer's jobs as well. Christoph has done an enormous amount of > highly productive work over the years. Sure, he's prickly and sure > there have been fights, but everyone except you seems to manage to > patch things up and accept his contributions. If it were just one > personal problem it might be overlookable, but you seem to be having > major fights with the maintainer of every subsystem you touch... James, I will bend over backwards to work with people who will work to continue the technical discussion. That's what I'm here to do. I'm going to bow out of this line of discussion on the thread. Feel free to continue privately if you like.
On Fri, 2023-07-07 at 12:48 -0400, Kent Overstreet wrote: > On Fri, Jul 07, 2023 at 12:26:19PM -0400, James Bottomley wrote: > > On Fri, 2023-07-07 at 05:18 -0400, Kent Overstreet wrote: [...] > > > In that offlist thread, I don't recall much in the way of actual, > > > concrete concerns. I do recall Christoph doing his usual schpiel; > > > and to be clear, I cut short my interactions with Christoph > > > because in nearly 15 years of kernel development he's never been > > > anything but hostile to anything I've posted, and the criticisms > > > he posts tend to be vague and unaware of the surrounding > > > discussion, not anything actionable. > > > > This too is a red flag. Working with difficult people is one of a > > maintainer's jobs as well. Christoph has done an enormous amount > > of highly productive work over the years. Sure, he's prickly and > > sure there have been fights, but everyone except you seems to > > manage to patch things up and accept his contributions. If it were > > just one personal problem it might be overlookable, but you seem to > > be having major fights with the maintainer of every subsystem you > > touch... > > James, I will bend over backwards to work with people who will work > to continue the technical discussion. You will? Because that doesn't seem to align with your statement about Christoph being "vague and unaware of the surrounding discussions" and not posting "anything actionable" for the last 15 years. No-one else has that impression and we've almost all had run-ins with Christoph at some point. James
On Fri, Jul 07, 2023 at 01:04:14PM -0400, James Bottomley wrote: > On Fri, 2023-07-07 at 12:48 -0400, Kent Overstreet wrote: > > On Fri, Jul 07, 2023 at 12:26:19PM -0400, James Bottomley wrote: > > > On Fri, 2023-07-07 at 05:18 -0400, Kent Overstreet wrote: > [...] > > > > In that offlist thread, I don't recall much in the way of actual, > > > > concrete concerns. I do recall Christoph doing his usual schpiel; > > > > and to be clear, I cut short my interactions with Christoph > > > > because in nearly 15 years of kernel development he's never been > > > > anything but hostile to anything I've posted, and the criticisms > > > > he posts tend to be vague and unaware of the surrounding > > > > discussion, not anything actionable. > > > > > > This too is a red flag. Working with difficult people is one of a > > > maintainer's jobs as well. Christoph has done an enormous amount > > > of highly productive work over the years. Sure, he's prickly and > > > sure there have been fights, but everyone except you seems to > > > manage to patch things up and accept his contributions. If it were > > > just one personal problem it might be overlookable, but you seem to > > > be having major fights with the maintainer of every subsystem you > > > touch... > > > > James, I will bend over backwards to work with people who will work > > to continue the technical discussion. > > You will? Because that doesn't seem to align with your statement about > Christoph being "vague and unaware of the surrounding discussions" and > not posting "anything actionable" for the last 15 years. No-one else > has that impression and we've almost all had run-ins with Christoph at > some point. If I'm going to respond to this I'd have to start citing interactions and I don't want to dig things that deep in public. Can we either try to resolve this privately or drop it?
On Fri, Jul 07, 2023 at 12:26:19PM -0400, James Bottomley wrote: > On Fri, 2023-07-07 at 05:18 -0400, Kent Overstreet wrote: > > Christain, the hostility I'm reading is in your steady passive > > aggressive accusations, and your patronizing attitude. It's not > > professional, and it's not called for. > > Can you not see that saying this is a huge red flag? With you every > disagreement becomes, as Josef said, "a hill to die on" and you then > feel entitled to indulge in ad hominem attacks, like this, or be > dismissive or try to bury whoever raised the objection in technical > minutiae in the hope you can demonstrate you have a better grasp of the > details than they do and therefore their observation shouldn't count. > > One of a maintainer's jobs is to nurture and build a community and > that's especially important at the inclusion of a new feature. What > we've seen from you implies you'd build a community of little Kents > (basically an echo chamber of people who agree with you) and use them > as a platform to attack any area of the kernel you didn't agree with > technically (which, apparently, would be most of block and vfs with a > bit of mm thrown in), leading to huge divisions and infighting. Anyone > who had the slightest disagreement with you would be out and would > likely behave in the same way you do now leading to internal community > schisms and more fighting on the lists. > > We've spent years trying to improve the lists and make the community > welcoming. However technically brilliant a new feature is, it can't > come with this sort of potential for community and reputational damage. I don't think the lists are any better, tbh. Yes, the LF has done a great job of telling people not to use "bad words" any more. But people are still arseholes to each other. They're just more subtle about it now. I'm not going to enumerate the ways because that's pointless. Consider this thread from Kent's point of view. He's worked for years on bcachefs. Now he's asking "What needs to happen to get this merged?" And instead of getting a clear answer as to the technical pieces that need to get fixed, various people are taking the opportunity to tell him he's a Bad Person. And when he reacts to that, this is taken as more evidence that he's a Bad Person, rather than being a person who is in a stressful situation (Limbo? Purgatory?) who is perhaps not reacting in the most constructive way. I don't think Kent is particularly worse as a fellow developer than you or I or Jens, Greg, Al, Darrick, Dave, Dave, Dave, Dave, Josef or Brian. There are some social things which are a concern to me. There's no obvious #2 or #3 to step in if Kent does get hit by the proverbial bus, but that's been discussed elsewhere in the thread. Anyway, I'm in favour of bcachefs inclusion. I think the remaining problems can be worked out post-merge. I don't see Kent doing a drop-and-run on the codebase. Maintaining this much code outside the main kernel tree is hard. One thing I particularly like about btrfs compared to ntfs3 is that it doesn't use old legacy code like the buffer heads, which means that it doesn't add to the technical debt. From the page cache point of view, it's fairly clean. I wish it used iomap, but iomap would need quite a lot of new features to accommodate everything bcachefs wants to do. Maybe iomap will grow those features over time.
On Sat, Jul 08, 2023 at 04:54:22AM +0100, Matthew Wilcox wrote: > One thing I particularly like about btrfs :) > compared to ntfs3 is that it doesn't use old legacy code like the buffer > heads, which means that it doesn't add to the technical debt. From the > page cache point of view, it's fairly clean. I wish it used iomap, but > iomap would need quite a lot of new features to accommodate everything > bcachefs wants to do. Maybe iomap will grow those features over time. My big complaint with iomap is that it's still the old callback based approach - an indirect function call into the filesystem to get a mapping, then Doing Stuff, for every walk. Instead of calling back and forth, we could be filling out a data structure to represent the IO, then handing it off to the filesystem to look up the mappings and send to the right place, splitting as needed. Best part is, we already have such a data structure: struct bio. That's the approach bcachefs takes. It would be nice sharing the page cache management code, but like you mentioned, iomap would have to grow a bunch of features. But, some of those features other users might like: in particular bcachefs hangs disk reservations and dirty sector (for i_blocks accounting) off the pagecache, which to me is a total no brainer, it eliminates looking up in a second data structure for e.g. the buffered write path. Also worth noting - bcachefs has had large folio support for awhile now :)
On Sat, Jul 08, 2023 at 04:54:22AM +0100, Matthew Wilcox wrote: > On Fri, Jul 07, 2023 at 12:26:19PM -0400, James Bottomley wrote: > > On Fri, 2023-07-07 at 05:18 -0400, Kent Overstreet wrote: > > > Christain, the hostility I'm reading is in your steady passive > > > aggressive accusations, and your patronizing attitude. It's not > > > professional, and it's not called for. > > > > Can you not see that saying this is a huge red flag? With you every > > disagreement becomes, as Josef said, "a hill to die on" and you then > > feel entitled to indulge in ad hominem attacks, like this, or be > > dismissive or try to bury whoever raised the objection in technical > > minutiae in the hope you can demonstrate you have a better grasp of the > > details than they do and therefore their observation shouldn't count. > > > > One of a maintainer's jobs is to nurture and build a community and > > that's especially important at the inclusion of a new feature. What > > we've seen from you implies you'd build a community of little Kents > > (basically an echo chamber of people who agree with you) and use them > > as a platform to attack any area of the kernel you didn't agree with > > technically (which, apparently, would be most of block and vfs with a > > bit of mm thrown in), leading to huge divisions and infighting. Anyone > > who had the slightest disagreement with you would be out and would > > likely behave in the same way you do now leading to internal community > > schisms and more fighting on the lists. > > > > We've spent years trying to improve the lists and make the community > > welcoming. However technically brilliant a new feature is, it can't > > come with this sort of potential for community and reputational damage. > > I don't think the lists are any better, tbh. Yes, the LF has done a great > job of telling people not to use "bad words" any more. But people are > still arseholes to each other. They're just more subtle about it now. > I'm not going to enumerate the ways because that's pointless. I've long thought a more useful CoC would start with "always try to continue the technical conversation in good faith, always try to build off of what other people are saying; don't shut people down". The work we do has real consequences. There are consequences for the people doing the work, and consequences for the people that use our work if we screw things up. Things are bound to get heated at times; that's expected, and it's ok - as long as we can remember to keep doing the work and pushing forward.
On Sat, Jul 08, 2023 at 12:31:36AM -0400, Kent Overstreet wrote: > > I've long thought a more useful CoC would start with "always try to > continue the technical conversation in good faith, always try to build > off of what other people are saying; don't shut people down". Kent, with all due respect, do you not always follow your suggested formulation that you've stated above. That is to say, you do not always assume that your conversational partner is trying to raise objections in good faith. You also want to assume that you are the smartest person in the room, and if they object, they are Obviously Wrong. As a result, it's not pleasant to have a technical conversation with you, and as others have said, when someone like Christian Brauner has decided that it's too frustating to continue with the thread, given my observations of his past interaction with a wide variety of people, including some folks who have been traditionally regarded as "difficult to work with", it's a real red flag. Regards, - Ted
On Sat, Jul 08, 2023 at 11:02:49AM -0400, Theodore Ts'o wrote: > On Sat, Jul 08, 2023 at 12:31:36AM -0400, Kent Overstreet wrote: > > > > I've long thought a more useful CoC would start with "always try to > > continue the technical conversation in good faith, always try to build > > off of what other people are saying; don't shut people down". > > Kent, with all due respect, do you not always follow your suggested > formulation that you've stated above. That is to say, you do not > always assume that your conversational partner is trying to raise > objections in good faith. Ted, how do you have a technical conversation with someone who refuses to say anything concrete, even when you ask them to elaborate on their objections, and instead just repeats the same vague non-answers? > You also want to assume that you are the smartest person in the room, > and if they object, they are Obviously Wrong. Ok, now you're really reaching. Anyone who's actually worked with me can tell you I am quick to consider other people's point of view and quick to admit when I'm wrong. All I ask is the same courtesy.
On Sat, 2023-07-08 at 04:54 +0100, Matthew Wilcox wrote: > On Fri, Jul 07, 2023 at 12:26:19PM -0400, James Bottomley wrote: > > On Fri, 2023-07-07 at 05:18 -0400, Kent Overstreet wrote: > > > Christain, the hostility I'm reading is in your steady passive > > > aggressive accusations, and your patronizing attitude. It's not > > > professional, and it's not called for. > > > > Can you not see that saying this is a huge red flag? With you > > every disagreement becomes, as Josef said, "a hill to die on" and > > you then feel entitled to indulge in ad hominem attacks, like this, > > or be dismissive or try to bury whoever raised the objection in > > technical minutiae in the hope you can demonstrate you have a > > better grasp of the details than they do and therefore their > > observation shouldn't count. > > > > One of a maintainer's jobs is to nurture and build a community and > > that's especially important at the inclusion of a new feature. > > What we've seen from you implies you'd build a community of little > > Kents (basically an echo chamber of people who agree with you) and > > use them as a platform to attack any area of the kernel you didn't > > agree with technically (which, apparently, would be most of block > > and vfs with a bit of mm thrown in), leading to huge divisions and > > infighting. Anyone who had the slightest disagreement with you > > would be out and would likely behave in the same way you do now > > leading to internal community schisms and more fighting on the > > lists. > > > > We've spent years trying to improve the lists and make the > > community welcoming. However technically brilliant a new feature > > is, it can't come with this sort of potential for community and > > reputational damage. > > I don't think the lists are any better, tbh. Yes, the LF has done a > great job of telling people not to use "bad words" any more. But > people are still arseholes to each other. I don't think the LF has done much actively on the lists ... we've been trying to self police. > They're just more subtle about it now. I'm not going to enumerate > the ways because that's pointless. Well, we can agree to differ since this isn't relevant to the main argument. > Consider this thread from Kent's point of view. He's worked for > years on bcachefs. Now he's asking "What needs to happen to get this > merged?" And instead of getting a clear answer as to the technical > pieces that need to get fixed, various people are taking the > opportunity to tell him he's a Bad Person. And when he reacts to > that, this is taken as more evidence that he's a Bad Person, rather > than being a person who is in a stressful situation (Limbo? > Purgatory?) who is perhaps not reacting in the most constructive way. That's a bit of a straw man: I never said or implied "bad person". I gave two examples, one from direct list interaction and one quoted from Kent of what I consider to be red flags behaviours on behalf of a maintainer. > I don't think Kent is particularly worse as a fellow developer than > you or I or Jens, Greg, Al, Darrick, Dave, Dave, Dave, Dave, Josef or > Brian. I don't believe any of us have been unable to work with a fairly prolific contributor for 15 years ... > There are some social things which are a concern to me. There's no > obvious #2 or #3 to step in if Kent does get hit by the proverbial > bus, but that's been discussed elsewhere in the thread. Actually, I don't think this is a problem: a new feature has no users and having no users, it doesn't matter if it loses its only maintainer because it can be excised without anyone really noticing. The bus problem (or more accurately xkcd 2347 problem) commonly applies to a project with a lot of users but an anaemic developer community, which is a thing that can be grown to but doesn't happen ab initio. The ordinary course for a kernel feature is single developer; hobby project (small community of users as developers); and eventually a non technical user community. Usually the hobby project phase grows enough interested developers to ensure a fairly healthy developer community by the time it actually acquires non developer users (and quite a few of our features never actually get out of the hobby project phase). James
On Sat, Jul 08, 2023 at 12:42:59PM -0400, James Bottomley wrote: > That's a bit of a straw man: I never said or implied "bad person". I > gave two examples, one from direct list interaction and one quoted from > Kent of what I consider to be red flags behaviours on behalf of a > maintainer. You responded with a massive straw man about an army of little Kents - seriously, what the hell was that about? The only maintainers that I've had ongoing problems with have been Jens and Christoph, and there's more history to that than I want to get into. If you're talking about _our_ disagreement, I was arguing that cut and pasting code from other repositories is a terrible workflow that's going to cause us problems down the road, especially for the Rust folks, and then afterwards you started hounding me in unrelated LKML discussions. So clearly you took that personally, and I think maybe you still are.
So: looks like we missed the merge window. Boo :) Summing up discussions from today's cabal meeting, other off list discussions, and this thread: - bcachefs is now marked EXPERIMENTAL - Brian Foster will be listed as a reviewer - Josef's stepping up to do some code review, focusing on vfs-interacty bits. I'm hoping to do at least some of this in a format where Josef peppers me with questions and we turn that into new code documentation, so others can directly benefit: if anyone has an area they work on and would like to see documented in bcachefs, we'll take a look at that too. - Prereq patch series has been pruned down a bit more; also Mike Snitzer suggested putting those patches in their own branch: https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-prereqs "iov_iter: copy_folio_from_iter_atomic()" was dropped and replaced with willy's "iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic()"; he said he'd try to send this for -rc4 since it's technically a bug fix; in the meantime, it'll be getting more testing from my users. The two lockdep patches have been dropped for now; the bcachefs-for-upstream branch is switched back to lockdep_set_novalidate_class() for btree node locks. six locks, mean and variance have been moved into fs/bcachefs/ for now; this means there's a new prereq patch to export osq_(lock|unlock) The remaining prereq patches are pretty trivial, with the exception of "block: Don't block on s_umount from __invalidate_super()". I would like to get a reviewed-by for that patch, and it wouldn't hurt for others. previously posting: https://lore.kernel.org/linux-bcachefs/20230509165657.1735798-1-kent.overstreet@linux.dev/T/#m34397a4d39f5988cc0b635e29f70a6170927746f - Code review was talked about a bit earlier in the thread: for the moment I'm just posting big stuff, but I'd like to aim for making sure all patches (including mine) hit the linux-bcachefs mailing list in the future: https://lore.kernel.org/linux-bcachefs/20230709171551.2349961-1-kent.overstreet@linux.dev/T/ - We also talked quite a bit about the QA process. I'm going to work on finally publishing ktest/ktestci, which is my test infrastructure that myself and a few other people are using - I'd like to see it used more widely. For now, here's the test dashboard for the bcachefs-for-upstream branch: https://evilpiepirate.org/~testdashboard/ci?branch=bcachefs-for-upstream - Also: not directly related to upstreaming, but relevant for the community: we talked about getting together a meeting with some of the btrfs people to gather design input, ideas, and lessons learned. If anyone would be interested in working on and improving the multi device capabilities of bcachefs in particular, this would be a great time to get involved. That stuff is in good shape and seeing a lot of active use - it's one of bcachefs's major drawing points - and I want it to be even better. And here's the branch I intend to re-submit next merge window, as it currently sits: https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-for-upstream Please chime in if I forgot anything important... :) Cheers, Kent
On Tue, Jul 11, 2023 at 10:54:59PM -0400, Kent Overstreet wrote: > - Prereq patch series has been pruned down a bit more; also Mike > Snitzer suggested putting those patches in their own branch: > > https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-prereqs > > "iov_iter: copy_folio_from_iter_atomic()" was dropped and replaced > with willy's "iov_iter: Handle compound highmem pages in > copy_page_from_iter_atomic()"; he said he'd try to send this for -rc4 > since it's technically a bug fix; in the meantime, it'll be getting > more testing from my users. > > The two lockdep patches have been dropped for now; the > bcachefs-for-upstream branch is switched back to > lockdep_set_novalidate_class() for btree node locks. > > six locks, mean and variance have been moved into fs/bcachefs/ for > now; this means there's a new prereq patch to export > osq_(lock|unlock) > > The remaining prereq patches are pretty trivial, with the exception > of "block: Don't block on s_umount from __invalidate_super()". I > would like to get a reviewed-by for that patch, and it wouldn't hurt > for others. > > previously posting: > https://lore.kernel.org/linux-bcachefs/20230509165657.1735798-1-kent.overstreet@linux.dev/T/#m34397a4d39f5988cc0b635e29f70a6170927746f Can you send these prereqs out again, with maintainers CCed appropriately? (I think some feedback from the prior revision needs to be addressed first, though. For example, __flatten already exists, etc.)
On Wed, Jul 12, 2023 at 12:48:31PM -0700, Kees Cook wrote: > On Tue, Jul 11, 2023 at 10:54:59PM -0400, Kent Overstreet wrote: > > - Prereq patch series has been pruned down a bit more; also Mike > > Snitzer suggested putting those patches in their own branch: > > > > https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-prereqs > > > > "iov_iter: copy_folio_from_iter_atomic()" was dropped and replaced > > with willy's "iov_iter: Handle compound highmem pages in > > copy_page_from_iter_atomic()"; he said he'd try to send this for -rc4 > > since it's technically a bug fix; in the meantime, it'll be getting > > more testing from my users. > > > > The two lockdep patches have been dropped for now; the > > bcachefs-for-upstream branch is switched back to > > lockdep_set_novalidate_class() for btree node locks. > > > > six locks, mean and variance have been moved into fs/bcachefs/ for > > now; this means there's a new prereq patch to export > > osq_(lock|unlock) > > > > The remaining prereq patches are pretty trivial, with the exception > > of "block: Don't block on s_umount from __invalidate_super()". I > > would like to get a reviewed-by for that patch, and it wouldn't hurt > > for others. > > > > previously posting: > > https://lore.kernel.org/linux-bcachefs/20230509165657.1735798-1-kent.overstreet@linux.dev/T/#m34397a4d39f5988cc0b635e29f70a6170927746f > > Can you send these prereqs out again, with maintainers CCed > appropriately? (I think some feedback from the prior revision needs to > be addressed first, though. For example, __flatten already exists, etc.) Thanks for pointing that out, I knew it was in the pipeline :) Will do...
On Tue, Jul 11, 2023 at 10:54:59PM -0400, Kent Overstreet wrote: > So: looks like we missed the merge window. Boo :) > > Summing up discussions from today's cabal meeting, other off list > discussions, and this thread: > > - bcachefs is now marked EXPERIMENTAL > > - Brian Foster will be listed as a reviewer /me applauds! > - Josef's stepping up to do some code review, focusing on vfs-interacty > bits. I'm hoping to do at least some of this in a format where Josef > peppers me with questions and we turn that into new code > documentation, so others can directly benefit: if anyone has an area > they work on and would like to see documented in bcachefs, we'll take > a look at that too. > > - Prereq patch series has been pruned down a bit more; also Mike > Snitzer suggested putting those patches in their own branch: > > https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-prereqs > > "iov_iter: copy_folio_from_iter_atomic()" was dropped and replaced > with willy's "iov_iter: Handle compound highmem pages in > copy_page_from_iter_atomic()"; he said he'd try to send this for -rc4 > since it's technically a bug fix; in the meantime, it'll be getting > more testing from my users. > > The two lockdep patches have been dropped for now; the > bcachefs-for-upstream branch is switched back to > lockdep_set_novalidate_class() for btree node locks. > > six locks, mean and variance have been moved into fs/bcachefs/ for > now; this means there's a new prereq patch to export > osq_(lock|unlock) > > The remaining prereq patches are pretty trivial, with the exception > of "block: Don't block on s_umount from __invalidate_super()". I > would like to get a reviewed-by for that patch, and it wouldn't hurt > for others. > > previously posting: > https://lore.kernel.org/linux-bcachefs/20230509165657.1735798-1-kent.overstreet@linux.dev/T/#m34397a4d39f5988cc0b635e29f70a6170927746f > > - Code review was talked about a bit earlier in the thread: for the > moment I'm just posting big stuff, but I'd like to aim for making > sure all patches (including mine) hit the linux-bcachefs mailing list > in the future: > > https://lore.kernel.org/linux-bcachefs/20230709171551.2349961-1-kent.overstreet@linux.dev/T/ > > - We also talked quite a bit about the QA process. I'm going to work on > finally publishing ktest/ktestci, which is my test infrastructure > that myself and a few other people are using - I'd like to see it > used more widely. > > For now, here's the test dashboard for the bcachefs-for-upstream > branch: > https://evilpiepirate.org/~testdashboard/ci?branch=bcachefs-for-upstream > > - Also: not directly related to upstreaming, but relevant for the > community: we talked about getting together a meeting with some of > the btrfs people to gather design input, ideas, and lessons learned. Please invite me too! :) Granted XFS doesn't do multi-device support (for large values of 'multi') but now that I've spent 6 years of my life concentrating on repairability for XFS, I might have a few things to say about bcachefs. That is if I can shake off the torrent of syzbot crap long enough to read anything in bcachefs.git. :( --D > If anyone would be interested in working on and improving the multi > device capabilities of bcachefs in particular, this would be a great > time to get involved. That stuff is in good shape and seeing a lot of > active use - it's one of bcachefs's major drawing points - and I want > it to be even better. > > And here's the branch I intend to re-submit next merge window, as it > currently sits: > https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-for-upstream > > Please chime in if I forgot anything important... :) > > Cheers, > Kent
On Wed, Jul 12, 2023 at 03:10:12PM -0700, Darrick J. Wong wrote: > On Tue, Jul 11, 2023 at 10:54:59PM -0400, Kent Overstreet wrote: > > - Also: not directly related to upstreaming, but relevant for the > > community: we talked about getting together a meeting with some of > > the btrfs people to gather design input, ideas, and lessons learned. > > Please invite me too! :) > > Granted XFS doesn't do multi-device support (for large values of > 'multi') but now that I've spent 6 years of my life concentrating on > repairability for XFS, I might have a few things to say about bcachefs. Absolutely! Maybe we could start brainstorming ideas to cover now, on the list? I honestly know XFS so little (I've read code here and there, but I don't know much about the high level structure) that I wouldn't know where to start. Filesystems are such a huge world of "oh, that would've made my life so much easier if I'd had that idea at the right time..." > That is if I can shake off the torrent of syzbot crap long enough to > read anything in bcachefs.git. :( :(
[ *Finally* getting back to this, I wanted to start reviewing the changes immediately after the merge window, but something else always kept coming up .. ] On Tue, 11 Jul 2023 at 19:55, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > So: looks like we missed the merge window. Boo :) Well, looking at the latest 'bcachefs-for-upstream' state I see, I'm happy to see the pre-requisites outside bcachefs being much smaller. The six locks are now contained within bcachefs, and I like what I see more now that it doesn't play games with 'u64' and lots of bitfields. I'm still not actually convinced the locks *work* correctly, but I'm not seeing huge red flags. I do suspect there are memory ordering issues in there that would all be hidden on x86, and some of it looks strange, but not necessarily unfixable. Example of oddity: barrier(); w->lock_acquired = true; which really smells like it should be smp_store_release(&w->lock_acquired, true); (and the reader side in six_lock_slowpath() should be a smp_load_acquire()) because otherwise the preceding __list_del() writes would seem to possibly by re-ordered by the CPU to past the lock_acquired write, causing all kinds of problems. On x86, you'd never see that as an issue, since all writes are releases, so the 'barrier()' compiler ordering ends up forcing the right magic. Some of the other oddity is around the this_cpu ops, but I suspect that is at least partly then because we don't have acquire/release versions of the local cpu ops that the code looks like it would want. I did *not* look at any of the internal bcachefs code itself (apart from the locking, obviously). I'm not that much of a low-level filesystem person (outside of the vfs itself), so I just don't care deeply. I care that it's maintained and that people who *are* filesystem people are at least not hugely against it. That said, I do think that the prerequisites should go in first and independently, and through maintainers. And there clearly is something very strange going on with superblock handling and the whole crazy discussion about fput being delayed. It is what it is, and the patches I saw in this thread to not delay them were bad. As to the actual prereqs: I'm not sure why 'd_mark_tmpfile()' didn't do the d_instantiate() that everybody seems to want, but it looks fine to me. Maybe just because Kent wanted the "mark" semantics for the naming. Fine. The bio stuff should preferably go through Jens, or at least at a minimum be acked. The '.faults_disabled_mapping' thing is a bit odd, but I don't hate it, and I could imagine that other filesystems could possibly use that approach instead of the current 'pagefault_disable/enable' games and ->nofault games to avoid the whole "use mmap to have the source and the destination of a write be the same page" thing. So as things stand now, the stuff outside bcachefs itself I don't find objectionable. The stuff _inside_ bcachefs I care about only in the sense that I really *really* would like a locking person to look at the six locks, but at the same time as long as it's purely internal to bcachefs and doesn't possibly affect anything else, I'm not *too* worried about what I see. The thing that actually bothers me most about this all is the personal arguments I saw. That I don't know what to do about. I don't actually want to merge this over the objections of Christian, now that we have a responsible vfs maintainer. So those kinds of arguments do kind of have to be resolved, even aside from the "I think the prerequisites should go in separately or at least be clearly acked" issues. Sorry for the delay, I really did want to get these comments out directly after the merge window closed, but this just ended up always being the "next thing".. Linus
Adding Jens to the CC: On Tue, Aug 08, 2023 at 06:27:29PM -0700, Linus Torvalds wrote: > [ *Finally* getting back to this, I wanted to start reviewing the > changes immediately after the merge window, but something else always > kept coming up .. ] > > On Tue, 11 Jul 2023 at 19:55, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > So: looks like we missed the merge window. Boo :) > > Well, looking at the latest 'bcachefs-for-upstream' state I see, I'm > happy to see the pre-requisites outside bcachefs being much smaller. > > The six locks are now contained within bcachefs, and I like what I see > more now that it doesn't play games with 'u64' and lots of bitfields. Heh, I liked the bitfields - I prefer that to open coding structs, which is a major pet peeve of mine. But the text size went down a lot a lot without them (would like to know why the compiler couldn't constant fold all that stuff out, but... not enough to bother). Anyways... > I'm still not actually convinced the locks *work* correctly, but I'm > not seeing huge red flags. I do suspect there are memory ordering > issues in there that would all be hidden on x86, and some of it looks > strange, but not necessarily unfixable. > > Example of oddity: > > barrier(); > w->lock_acquired = true; > > which really smells like it should be > > smp_store_release(&w->lock_acquired, true); > > (and the reader side in six_lock_slowpath() should be a > smp_load_acquire()) because otherwise the preceding __list_del() > writes would seem to possibly by re-ordered by the CPU to past the > lock_acquired write, causing all kinds of problems. > > On x86, you'd never see that as an issue, since all writes are > releases, so the 'barrier()' compiler ordering ends up forcing the > right magic. Yep, agreed. Also, there's a mildly interesting optimization here: the thread doing the unlock is taking the lock on behalf of the thread waiting for the lock, and signalling via the waitlist entry: this means the thread waiting for the lock doesn't have to touch the cacheline the lock is on at all. IOW, a better version of the handoff that rwsem/mutex do. Been meaning to experiment with dropping osq_lock and instead just adding to the waitlist and spinning on w->lock_acquired; this should actually simplify the code and be another small optimization (less bouncing of the lock cacheline). > Some of the other oddity is around the this_cpu ops, but I suspect > that is at least partly then because we don't have acquire/release > versions of the local cpu ops that the code looks like it would want. You mean using full barriers where acquire/release would be sufficient? > I did *not* look at any of the internal bcachefs code itself (apart > from the locking, obviously). I'm not that much of a low-level > filesystem person (outside of the vfs itself), so I just don't care > deeply. I care that it's maintained and that people who *are* > filesystem people are at least not hugely against it. > > That said, I do think that the prerequisites should go in first and > independently, and through maintainers. Matthew was planning on sending the iov_iter patch to you - right around now, I believe, as a bugfix, since right now copy_page_from_iter_atomic() silently does crazy things if you pass it a compound page. Block layer patches aside, are there any _others_ you really want to go via maintainers? Because the consensus in the past when I was feeding in prereqs for bcachefs was that patches just for bcachefs should go with the bcachefs pull request. > And there clearly is something very strange going on with superblock > handling This deserves an explanation because sget() is a bit nutty. The way sget() is conventionally used for block device filesystems, the block device open _isn't actually exclusive_ - sure, FMODE_EXCL is used, but the holder is the fs type pointer, so it won't exclude with other opens of the same fs type. That means the only protection from multiple opens scribbling over each other is sget() itself - but if the bdev handle ever outlives the superblock we're completely screwed; that's a silent data corruption bug that we can't easily catch, and if the filesystem teardown path has any asynchronous stuff going on (and of course it does) that's not a hard mistake to make. I've observed at least one bug that looked suspiciously like that, but I don't think I quite pinned it down at the time. It also forces the caller to separate opening of the block devices from the rest of filesystem initialization, which is a bit less than ideal. Anyways, bcachefs just wants to be able to do real exclusive opens of the block devices, and we do all filesystem bringup with a single bch2_fs_open() call. I think this could be made to work with the way sget() wants to work, but it'd require reworking the locking in sget() - it does everything, including the test() and set() calls, under a single spinlock. > and the whole crazy discussion about fput being delayed. It > is what it is, and the patches I saw in this thread to not delay them > were bad. Jens claimed AIO was broken in the same way as io_uring, but it turned out that it's not - the test he posted was broken. And io_uring really is broken here. Look, the tests that are breaking because of this are important ones (generic/388 in particular), and those tests are no good to us if they're failing because of io_uring crap and Jens is throwing up his hands and saying "trainwreck!" when we try to get it fixed. > As to the actual prereqs: > > I'm not sure why 'd_mark_tmpfile()' didn't do the d_instantiate() that > everybody seems to want, but it looks fine to me. Maybe just because > Kent wanted the "mark" semantics for the naming. Fine. Originally, we were doing d_instantiate() separately, in common code, and the d_mark_tmpfile() was separate. Looking over the code again that would still be a reasonable approach, so I'd keep it that way. > The bio stuff should preferably go through Jens, or at least at a > minimum be acked. So, the block layer patches have been out on the list and been discussed, and they got an "OK" from Jens - https://lore.kernel.org/linux-fsdevel/aeb2690c-4f0a-003d-ba8b-fe06cd4142d1@kernel.dk/ But that's a little ambiguous - Jens, what do you want to do with those patches? I can re-send them to you now if you want to take them through your tree, or an ack would be great. > The '.faults_disabled_mapping' thing is a bit odd, but I don't hate > it, and I could imagine that other filesystems could possibly use that > approach instead of the current 'pagefault_disable/enable' games and > ->nofault games to avoid the whole "use mmap to have the source and > the destination of a write be the same page" thing. > > So as things stand now, the stuff outside bcachefs itself I don't find > objectionable. > > The stuff _inside_ bcachefs I care about only in the sense that I > really *really* would like a locking person to look at the six locks, > but at the same time as long as it's purely internal to bcachefs and > doesn't possibly affect anything else, I'm not *too* worried about > what I see. > > The thing that actually bothers me most about this all is the personal > arguments I saw. That I don't know what to do about. I don't actually > want to merge this over the objections of Christian, now that we have > a responsible vfs maintainer. I don't want to do that to Christian either, I think highly of the work he's been doing and I don't want to be adding to his frustration. So I apologize for loosing my cool earlier; a lot of that was frustration from other threads spilling over. But: if he's going to be raising objections, I need to know what his concerns are if we're going to get anywhere. Raising objections without saying what the concerns are shuts down discussion; I don't think it's unreasonable to ask people not to do that, and to try and stay focused on the code. He's got an open invite to the bcachefs meeting, and we were scheduled to talk Tuesday but he was out sick - anyways, I'm looking forward to hearing what he has to say. More broadly, it would make me really happy if we could get certain people to take a more constructive, "what do we really care about here and how do we move forward" attitude instead of turning every interaction into an opportunity to dig their heels in on process and throw up barriers. That burns people out, fast. And it's getting to be a problem in -fsdevel land; I've lost count of the times I've heard Eric Sandeen complain about how impossible it is to get things merge, and I _really_ hope people are taking notice about Darrick stepping away from XFS and asking themselves what needs to be sorted out. Darrick writes meticulous, well documented code; when I think of people who slip by hacks other people are going to regret later, he's not one of them. And yet, online fsck for XFS has been pushed back repeatedly because of petty bullshit. Scaling laws being what they are, that's a feature we're going to need, and more importantly XFS cannot afford to lose more people - especially Darrick. To speak a bit to what's been driving _me_ a bit nuts in these discussions, top of my list is that the guy who's been the most obstinate and argumentative _to this day_ refuses to CC me when touching code I wrote - and as a result we've had some really nasty bugs (memory corruption, _silent data corruption_). So that really needs to change. Let's just please have a little more focus on not eating people's data, and being more responsible about bugs. Anyways, I just want to write the best code I can. That's all I care about, and I'm always happy to interact with people who share that goal. Cheers, Kent
On Thu, 10 Aug 2023 at 08:55, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > Heh, I liked the bitfields - I prefer that to open coding structs, which > is a major pet peeve of mine. But the text size went down a lot a lot > without them (would like to know why the compiler couldn't constant fold > all that stuff out, but... not enough to bother). Bitfields are actually absolutely *horrioble* for many many reasons. The bit ordering being undefined is just one of them. Yes, they are convenient syntax, but the downsides of them means that you should basically only use them for code that has absolutely zero subtle issues. Avoid them like the plague with any kind of "data transfer issues", so in the kernel avoid using them for user interfaces unless you are willing to deal with the duplication and horror of __LITTLE_ENDIAN_BITFIELD etc. Also avoid them if there is any chance of "raw format" issues: either saving binary data formats, or - as in your original code - using unions. As I pointed out your code was actively buggy, because you thought it was little-endian. That's not even true on little-endian machines (ie POWERPC is big-endian in bitfields, even when being little-endian in bytes!). Finally, as you found out, it also easily generates horrid code. It's just _harder_ for compilers to do the right thing, particularly when it's not obvious that other parts of the structure may be "don't care" because they got initialized earlier (or will be initialized later). Together with threading requirements, compilers might do a bad job either because of the complexity, or simply because of subtle consistency rules. End result: by all means use bitfields for the *simple* cases where they are used purely for internal C code with no form of external exposure, but be aware that even then the syntax convenience easily comes at a cost. > > On x86, you'd never see that as an issue, since all writes are > > releases, so the 'barrier()' compiler ordering ends up forcing the > > right magic. > > Yep, agreed. But you should realize that on other architectures, I think that "barrier() + plain write" is actively buggy. On x86 it's safe, but on arm (and in fact pretty much anything but s390), the barrier() does nothing in SMP. Yes, it constrains the compiler, but the earlier writes to remove the entry from the list may happen *later* as far as other CPUs are concerned. Which can be a huge problem if the "struct six_lock_waiter" is on the stack - which I assume it is - and the waiter is just spinning on w->lock_acquired. The data structure may be re-used as regular stack space by the time the list removal code happens. Debugging things like that is a nightmare. And you'll never see it on x86, and it doesn't look possible when looking at the code, and the oopses on other architectures will be completely random stack corruption some time after it got the lock. So this is kind of why I worry about locking. It's really easy to write code that works 99.9% of the time, but then breaks when you are unlucky. And depending on the pattern, the "when you are unlucky" may or may not be possible on x86. It's not like x86 has total memory ordering either, it's just stronger than most. > > Some of the other oddity is around the this_cpu ops, but I suspect > > that is at least partly then because we don't have acquire/release > > versions of the local cpu ops that the code looks like it would want. > > You mean using full barriers where acquire/release would be sufficient? Yes. That code looks like it should work, but be hugely less efficient than it might be. "smp_mb()" tends to be expensive everywhere, even x86. Of course, I might be missing some other cases. That percpu reader queue worries me a bit just because it ends up generating ordering based on two different things - the lock word _and_ the percpu word. And I get very nervous if the final "this gets the lock" isn't some obvious "try_cmpxchg_acquire()" or similar, just because we've historically had *so* many very subtle bugs in just about every single lock we've ever had. > Matthew was planning on sending the iov_iter patch to you - right around > now, I believe, as a bugfix, since right now > copy_page_from_iter_atomic() silently does crazy things if you pass it a > compound page. > > Block layer patches aside, are there any _others_ you really want to go > via maintainers? It was mainly just the iov and the block layer. The superblock cases I really don't understand why you insist on just being different from everybody else. Your exclusivity arguments make no sense to me. Just open the damn thing. No other filesystem has ever had the fundamental problems you describe. You can do any exclusivity test you want in the "test()/set()" functions passed to sget(). You say that it's a problem because of a "single spinlock", but it hasn't been a problem for anybody else. I don't understand why you are so special. The whole problem seems made-up. > More broadly, it would make me really happy if we could get certain > people to take a more constructive, "what do we really care about here > and how do we move forward" attitude instead of turning every > interaction into an opportunity to dig their heels in on process and > throw up barriers. Honestly, I think one huge problem here is that you've been working on this for a long time (what - a decade by now?) and you've made all these decisions that you explicitly wanted to be done independently and intentionally outside the kernel. And then you feel that "now it's ready to be included", and think that all those decisions you made outside of the mainline kernel now *have* to be done that way, and basically sent your first pull request as a fait-accompli. The six-locks showed some of that, but as long as they are bcachefs-internal, I don't much care. The sget() thing really just smells like "this is how I designed things, and that's it". Linus
On Thu 10-08-23 11:54:53, Kent Overstreet wrote: > > And there clearly is something very strange going on with superblock > > handling > > This deserves an explanation because sget() is a bit nutty. > > The way sget() is conventionally used for block device filesystems, the > block device open _isn't actually exclusive_ - sure, FMODE_EXCL is used, > but the holder is the fs type pointer, so it won't exclude with other > opens of the same fs type. > > That means the only protection from multiple opens scribbling over each > other is sget() itself - but if the bdev handle ever outlives the > superblock we're completely screwed; that's a silent data corruption bug > that we can't easily catch, and if the filesystem teardown path has any > asynchronous stuff going on (and of course it does) that's not a hard > mistake to make. I've observed at least one bug that looked suspiciously > like that, but I don't think I quite pinned it down at the time. This is just being changed - check Christian's VFS tree. There are patches that make sget() use superblock pointer as a bdev holder so the reuse you're speaking about isn't a problem anymore. > It also forces the caller to separate opening of the block devices from > the rest of filesystem initialization, which is a bit less than ideal. > > Anyways, bcachefs just wants to be able to do real exclusive opens of > the block devices, and we do all filesystem bringup with a single > bch2_fs_open() call. I think this could be made to work with the way > sget() wants to work, but it'd require reworking the locking in > sget() - it does everything, including the test() and set() calls, under > a single spinlock. Yeah. Maybe the current upstream changes aren't enough to make your life easier for bcachefs, btrfs does its special thing as well after all because mount also involves multiple devices for it. I just wanted to mention that the exclusive bdev open thing is changing. Honza
On Thu, Aug 10, 2023 at 09:40:08AM -0700, Linus Torvalds wrote: > > > Some of the other oddity is around the this_cpu ops, but I suspect > > > that is at least partly then because we don't have acquire/release > > > versions of the local cpu ops that the code looks like it would want. > > > > You mean using full barriers where acquire/release would be sufficient? > > Yes. > > That code looks like it should work, but be hugely less efficient than > it might be. "smp_mb()" tends to be expensive everywhere, even x86. do_six_unlock_type() doesn't need a full barrier, but I'm not sure we can avoid the one in __do_six_trylock(), in the percpu reader path. > Of course, I might be missing some other cases. That percpu reader > queue worries me a bit just because it ends up generating ordering > based on two different things - the lock word _and_ the percpu word. > > And I get very nervous if the final "this gets the lock" isn't some > obvious "try_cmpxchg_acquire()" or similar, just because we've > historically had *so* many very subtle bugs in just about every single > lock we've ever had. kernel/locking/percpu-rwsem.c uses the same idea. The difference is that percpu-rwsem avoids the memory barrier on the read side in the fast path at the cost of requiring an rcu barrier on the write side... and all the crazyness that entails. But __percpu_down_read_trylock() uses the same algorithm I'm using, including the same smp_mb(): we need to ensure that the read of the lock state happens after the store to the percpu read count, and I don't know how to that without a smp_mb() - smp_store_acquire() isn't a thing. > > Matthew was planning on sending the iov_iter patch to you - right around > > now, I believe, as a bugfix, since right now > > copy_page_from_iter_atomic() silently does crazy things if you pass it a > > compound page. > > > > Block layer patches aside, are there any _others_ you really want to go > > via maintainers? > > It was mainly just the iov and the block layer. > > The superblock cases I really don't understand why you insist on just > being different from everybody else. > > Your exclusivity arguments make no sense to me. Just open the damn > thing. No other filesystem has ever had the fundamental problems you > describe. You can do any exclusivity test you want in the > "test()/set()" functions passed to sget(). When using sget() in the conventional way it's not possible for FMODE_EXCL to protect against concurrent opens scribbling over each other because we open the block device before checking if it's already mounted, and we expect that open to succeed. > You say that it's a problem because of a "single spinlock", but it > hasn't been a problem for anybody else. The spinlock means you can't do the actual open in set(), which is why the block device has to be opened in not-really-exclusive mode. I think it's be possible to change the locking in sget() so that the set() callback could do the open, but I haven't looked closely at it. > and basically sent your first pull request as a fait-accompli. When did I ever do that?
On Thu, 10 Aug 2023 at 11:02, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > When using sget() in the conventional way it's not possible for > FMODE_EXCL to protect against concurrent opens scribbling over each > other because we open the block device before checking if it's already > mounted, and we expect that open to succeed. So? Read-only operations. Don't write to anything until after you then have verified your exclusive status. If you think you need to be exclusive to other people opening the device for other things, just stop expecting to control the whole world. Linus
On Thu, Aug 10, 2023 at 11:54:53AM -0400, Kent Overstreet wrote: > Adding Jens to the CC: <snip to the parts I care most about> > > and the whole crazy discussion about fput being delayed. It > > is what it is, and the patches I saw in this thread to not delay them > > were bad. > > Jens claimed AIO was broken in the same way as io_uring, but it turned > out that it's not - the test he posted was broken. > > And io_uring really is broken here. Look, the tests that are breaking > because of this are important ones (generic/388 in particular), and > those tests are no good to us if they're failing because of io_uring > crap and Jens is throwing up his hands and saying "trainwreck!" when we > try to get it fixed. FWIW I recently fixed all my stupid debian package dependencies so that I could actually install liburing again, and rebuilt fstests. The very next morning I noticed a number of new test failures in /exactly/ the way that Kent said to expect: fsstress -d /mnt & <sleep then simulate fs crash>; \ umount /mnt; mount /dev/sda /mnt Here, umount exits before the filesystem is really torn down, and then mount fails because it can't get an exclusive lock on the device. As a result, I can't test crash recovery or corrupted metadata shutdowns because of this delayed fput thing or whatever. It all worked before (even with libaio in use) I turned on io_uring. Obviously, I "fixed" this by modifying fsstress to require explicit enabling of io_uring operations; everything went back to green after that. I'm not familiar enough with the kernel side of io_uring to know what the solution here is; I'm merely here to provide a second data point. <snip again> > > The thing that actually bothers me most about this all is the personal > > arguments I saw. That I don't know what to do about. I don't actually > > want to merge this over the objections of Christian, now that we have > > a responsible vfs maintainer. > > I don't want to do that to Christian either, I think highly of the work > he's been doing and I don't want to be adding to his frustration. So I > apologize for loosing my cool earlier; a lot of that was frustration > from other threads spilling over. > > But: if he's going to be raising objections, I need to know what his > concerns are if we're going to get anywhere. Raising objections without > saying what the concerns are shuts down discussion; I don't think it's > unreasonable to ask people not to do that, and to try and stay focused > on the code. Yeah, I'm also really happy that we have a new/second VFS maintainer. I figure it's going to take us a while to help Christian to get past his fear and horror at the things lurking in fs/ but that's something worth doing. (I'm not presuming to know what Christian feels about the VFS; 'fear and horror' is what *I* feel every time I have to go digging down there. I'm extrapolating about what I would need, were I a new maintainer, to get myself to the point where I would have an open enough mind to engage with new or unfamiliar concepts so that a review cycle for something as big as bcachefs/online fsck/whatever would be productive.) > He's got an open invite to the bcachefs meeting, and we were scheduled > to talk Tuesday but he was out sick - anyways, I'm looking forward to > hearing what he has to say. > > More broadly, it would make me really happy if we could get certain > people to take a more constructive, "what do we really care about here > and how do we move forward" attitude ...and "what are all the supporting structures that we need to have in place to maximize the chances that we'll accomplish those goals"? > instead of turning every > interaction into an opportunity to dig their heels in on process and > throw up barriers. > > That burns people out, fast. And it's getting to be a problem in > -fsdevel land; Past-participle, not present. :/ I've said this previously, and I'll say it again: we're severely under-resourced. Not just XFS, the whole fsdevel community. As a developer and later a maintainer, I've learnt the hard way that there is a very large amount of non-coding work is necessary to build a good filesystem. There's enough not-really-coding work for several people. Instead, we lean hard on maintainers to do all that work. That might've worked acceptably for the first 20 years, but it doesn't now. Nowadays we have all these people running bots and AIs throwing a steady stream of bug reports and CVE reports at Dave [Chinner] and I. Most of these people *do not* help fix the problems they report. Once in a while there's an actual *user* report about data loss, but those (thankfully) aren't the majority of the reports. However, every one of these reports has to be triaged, analyzed, and dealt with. As soon as we clear one, at least one more rolls in. You know what that means? Dave and I are both in a permanent state of heightened alert, fear, and stress. We never get to settle back down to calm. Every time someone brings up syzbot, CVEs, or security? I feel my own stress response ramping up. I can no longer have "rational" conversations about syzbot because those discussions push my buttons. This is not healthy! Add to that the many demands to backport this and that to dozens of LTS kernels and distro kernels. Why do the participation modes for that seem to be (a) take on an immense amount of backporting work that you didn't ask for; or (b) let a non-public ML thing pick patches and get yelled at when it does the wrong thing? Nobody ever asked me if I thought the XFS community could support such-and-such LTS kernel. As the final insult, other people pile on by offering useless opinions about the maintainers being far behind and unhelpful suggestions that we engage in a major codebase rewrite. None of this is helpful. Dave and I are both burned out. I'm not sure Dave ever got past the 2017 burnout that lead to his resignation. Remarkably, he's still around. Is this (extended burnout) where I want to be in 2024? 2030? Hell no. I still have enough left that I want to help ourselves adapt our culture to solve these problems. I tried to get the conversation started with the maintainer entry profile for XFS that I recently submitted, but that alone cannot be the final product: https://lore.kernel.org/linux-xfs/169116629797.3243794.7024231508559123519.stgit@frogsfrogsfrogs/T/#m74bac05414cfba214f5cfa24a0b1e940135e0bed Being maintainer feels like a punishment, and that cannot stand. We need help. People see the kinds of interpersonal interactions going on here and decide pursue any other career path. I know so, some have told me themselves. You know what's really sad? Most of my friends work for small companies, nonprofits, and local governments. They report the same problems with overwork, pervasive fear and anger, and struggle to understand and adapt to new ideas that I observe here. They see the direct connection between their org's lack of revenue and the under resourcedness. They /don't/ understand why the hell the same happens to me and my workplace proximity associates, when we all work for companies that each clear hundreds of billions of dollars in revenue per year. (Well, they do understand: GREED. They don't get why we put up with this situation, or why we don't advocate louder for making things better.) > I've lost count of the times I've heard Eric Sandeen > complain about how impossible it is to get things merge, A group dynamic that I keep observing around here is that someone tries to introduce some unfamiliar (or even slightly new) concept, because they want the kernel to do something it didn't do before. The author sends out patches for review, and some of the reviewers who show up sound like they're so afraid of ... something ... that they throw out vague arguments that something might break. [I have had people tell me in private that while they don't have any specific complaints about online fsck, "something" is wrong and I need to stop and consider more thoroughly. Consider /what/?] Or, worse, no reviewers show up. The author merges it, and a month later there's a freakout because something somewhere else broke. Angry threads spread around fsdevel because now there's pressure to get it fixed before -rc8 (in the good case) or ASAP (because now it's released). Did the author have an incomplete understanding of the code? Were there potential reviewers who might've said something but bailed? Yes and yes. What do we need to reduce the amount of fear and anger around here, anyway? 20 years ago when I started my career in Linux I found the work to be challenging and enjoyable. Now I see a lot more anger, and I am sad, because there /are/ still enjoyable challenges to be undertaken. Can we please have that conversation? People and groups do not do well when they feel like they're under constant attack, like they have to brace themselves for whatever bullshit is coming next. That is how I feel most weeks, and I choose not to do that anymore. > and I _really_ > hope people are taking notice about Darrick stepping away from XFS and > asking themselves what needs to be sorted out. Me too. Ted expressed similar laments about ext4 after I announced my intention to reduce my own commitments to XFS. > Darrick writes > meticulous, well documented code; when I think of people who slip by > hacks other people are going to regret later, he's not one of them. I appreciate the compliment. ;) From what I can tell (because I lolquit and finally had time to start scanning the bcachefs code) I really like the thought that you've put into indexing and record iteration in the filesystem. I appreciate the amount of work you've put into making it easy and fast to run QA on bcachefs, even if we don't quite agree on whether or not I should rip and replace my 20yo Debian crazyquilt. > And yet, online fsck for XFS has been pushed back repeatedly because > of petty bullshit. A broader dynamic here is that I ask people to review the code so that I can merge it; they say they will do it; and then an entire cycle goes by without any visible progress. When I ask these people why they didn't follow through on their commitments, the responses I hear are pretty uniform -- they got buried in root cause analysis of a real bug report but lol there were no other senior people available; their time ended up being spent on backports or arguing about backports; or they got caught up in that whole freakout thing I described above. > Scaling laws being what they are, that's a feature we're going to need, > and more importantly XFS cannot afford to lose more people - especially > Darrick. While I was maintainer I lobbied managers at Oracle and Google and RH to hire new people to grow the size of the XFS community, and they did. That was awesome! It's not so hard to help managers come up with business justifications for headcount for critical pieces of their products*. But. For 2023 XFS is already down 2 people + whatever the hell I was doing that isn't "trying to get online fsck merged". We're still at +1, but still who's going to replace us oldtimers? --D * But f*** impossible to get that done when it's someone's 20% project causing a lot of friction on the mailing lists. > To speak a bit to what's been driving _me_ a bit nuts in these > discussions, top of my list is that the guy who's been the most > obstinate and argumentative _to this day_ refuses to CC me when touching > code I wrote - and as a result we've had some really nasty bugs (memory > corruption, _silent data corruption_). > > So that really needs to change. Let's just please have a little more > focus on not eating people's data, and being more responsible about > bugs. > > Anyways, I just want to write the best code I can. That's all I care > about, and I'm always happy to interact with people who share that goal. > > Cheers, > Kent
On Thu, Aug 10, 2023 at 11:54:53AM -0400, Kent Overstreet wrote: > Matthew was planning on sending the iov_iter patch to you - right around > now, I believe, as a bugfix, since right now > copy_page_from_iter_atomic() silently does crazy things if you pass it a > compound page. That's currently sitting in Darrick's iomap tree, commit 908a1ad89466 "iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic()" It's based on 6.5-rc3, so it would be entirely possible for Darrick to send Linus a pull request for 908a1ad89466 ... or you could pull in 908a1ad89466. I'll talk to Darrick tomorrow.
On Thu, 10 Aug 2023 at 15:39, Darrick J. Wong <djwong@kernel.org> wrote: > > FWIW I recently fixed all my stupid debian package dependencies so that > I could actually install liburing again, and rebuilt fstests. The very > next morning I noticed a number of new test failures in /exactly/ the > way that Kent said to expect: > > fsstress -d /mnt & <sleep then simulate fs crash>; \ > umount /mnt; mount /dev/sda /mnt > > Here, umount exits before the filesystem is really torn down, and then > mount fails because it can't get an exclusive lock on the device. I agree that that obviously sounds like mount is just returning either too early. Or too eagerly. But I suspect any delayed fput() issues (whether from aio or io_uring) are then just a way to trigger the problem, not the fundamental cause. Because even if the fput() is delayed, the mntput() part of that delayed __fput action is the one that *should* have kept the filesystem mounted until it is no longer busy. And more importantly, having some of the common paths synchronize *their* fput() calls only affects those paths. It doesn't affect the fundamental issue that the last fput() can happen in odd contexts when the file descriptor was used for something a bit stranger. So I do feel like the fput patch I saw looked more like a "hide the problem" than a real fix. Put another way: I would not be surprised in the *least* if then adding more synchronization to fput would basically hide any issue, particularly from tests that then use those things that you added synchronization for. But it really smells like it's purely hiding the symptom to me. If I were a betting man, I'd look at ->mnt_count. I'm not saying that's the problem, but the mnt refcount handling is more than a bit scary. It is so hugely performance-critical (ie every single path access) that we use those percpu counters for it, and I'm not at all sure it's all race-free. Just as an example, mnt_add_count() has this comment above it: * vfsmount lock must be held for read but afaik the main way it gets called is through mntget(), and I see no vfsmount lock held anywhere there (think "path_get() and friends). Maybe I'm missing something obvious. So I think that comment is some historical left-over that hasn't been true for ages. And all of the counter updates should be consistent even in the absence of said lock, so it's not an issue. Except when it is: it does look like it *would* screw up mnt_get_count() that tries to add up all those percput counters with for_each_possible_cpu(cpu) { count += per_cpu_ptr(mnt->mnt_pcp, cpu)->mnt_count; } and that one has that * vfsmount lock must be held for write comment, which makes sense as a "that would indeed synchronize if others held it for read". But... And where is that sum used? Very much in things like may_umount_tree(). Anyway, I'm absolutely not saying this is the actual problem - we probably at least partly just have stale or incomplete comments, and maybe I think the fput() side is good mainly because I'm *much* more familiar with that side than I am with the actual mount code these days. So I might be barking up entirely the wrong tree. But I do feel like the fput patch I saw looked more like a "hide the problem" than a real fix. Because the mount counting *should* be entirely independent of when exactly a fput happens. So I believe having the test-case then do some common fput's synchronously pretty much by definition can't fix any issues, but it *can* make sure that any normal test using just regular system calls then never triggers the "oh, in other situations the last fput will be delayed". So that's why I'm very leery of the fput patch I saw. I don't think it makes sense. That does *not* mean that I don't believe that umount can have serious problems. I suspect we get very little coverage of that in normal situations. And yes, obviously io_uring does add *way* more asynchronicity, and I'd not be surprised at all if it uncovers problems. In most other situations, the main source of file counts are purely open/close system calls, which are in many ways "simple" (and where things like process exit obviously does the closing part). Linus
On 8/10/23 5:47 PM, Linus Torvalds wrote: > On Thu, 10 Aug 2023 at 15:39, Darrick J. Wong <djwong@kernel.org> wrote: >> >> FWIW I recently fixed all my stupid debian package dependencies so that >> I could actually install liburing again, and rebuilt fstests. The very >> next morning I noticed a number of new test failures in /exactly/ the >> way that Kent said to expect: >> >> fsstress -d /mnt & <sleep then simulate fs crash>; \ >> umount /mnt; mount /dev/sda /mnt >> >> Here, umount exits before the filesystem is really torn down, and then >> mount fails because it can't get an exclusive lock on the device. > > I agree that that obviously sounds like mount is just returning either > too early. Or too eagerly. > > But I suspect any delayed fput() issues (whether from aio or io_uring) > are then just a way to trigger the problem, not the fundamental cause. > > Because even if the fput() is delayed, the mntput() part of that > delayed __fput action is the one that *should* have kept the > filesystem mounted until it is no longer busy. > > And more importantly, having some of the common paths synchronize > *their* fput() calls only affects those paths. > > It doesn't affect the fundamental issue that the last fput() can > happen in odd contexts when the file descriptor was used for something > a bit stranger. > > So I do feel like the fput patch I saw looked more like a "hide the > problem" than a real fix. The fput patch was not pretty, nor is it needed. What happens on the io_uring side is that pending requests (which can hold files referenced) are canceled on exit. But we don't wait for the references to go away, which then introduces this race. I've used this to trigger it: #!/bin/bash DEV=/dev/nvme0n1 MNT=/data ITER=0 while true; do echo loop $ITER sudo mount $DEV $MNT fio --name=test --ioengine=io_uring --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --thread=1 --output=/dev/null & Y=$(($RANDOM % 3)) X=$(($RANDOM % 10)) VAL="$Y.$X" sleep $VAL ps -e | grep fio > /dev/null 2>&1 while [ $? -eq 0 ]; do killall -9 fio > /dev/null 2>&1 wait > /dev/null 2>&1 ps -e | grep "fio " > /dev/null 2>&1 done sudo umount /data if [ $? -ne 0 ]; then break fi ((ITER++)) done and can make it happen pretty easily, within a few iterations. Contrary to how it was otherwise presented in this thread, I did take a look at this a month ago and wrote up some patches for it. Just rebased them on the current tree: https://git.kernel.dk/cgit/linux/log/?h=io_uring-exit-cancel Since we have task_work involved for both the completions and the __fput(), ordering is a concern which is why it needs a bit more effort than just the bare bones stuff. The way the task_work list works, we llist_del_all() and run all items. But we do encapsulate that in io_uring anyway, so it's possible to run our pending local items and avoid that particular snag. WIP obviously, the first 3-4 prep patches were posted earlier today, but I'm not happy with the last 3 yet in the above branch. Or at least not fully confident, so will need a bit more thinking and testing. Does pass the above test case, and the regular liburing test/regression cases, though.
On Thu, Aug 10, 2023 at 07:52:05PM +0200, Jan Kara wrote: > On Thu 10-08-23 11:54:53, Kent Overstreet wrote: > > > And there clearly is something very strange going on with superblock > > > handling > > > > This deserves an explanation because sget() is a bit nutty. > > > > The way sget() is conventionally used for block device filesystems, the > > block device open _isn't actually exclusive_ - sure, FMODE_EXCL is used, > > but the holder is the fs type pointer, so it won't exclude with other > > opens of the same fs type. > > > > That means the only protection from multiple opens scribbling over each > > other is sget() itself - but if the bdev handle ever outlives the > > superblock we're completely screwed; that's a silent data corruption bug > > that we can't easily catch, and if the filesystem teardown path has any > > asynchronous stuff going on (and of course it does) that's not a hard > > mistake to make. I've observed at least one bug that looked suspiciously > > like that, but I don't think I quite pinned it down at the time. > > This is just being changed - check Christian's VFS tree. There are patches > that make sget() use superblock pointer as a bdev holder so the reuse > you're speaking about isn't a problem anymore. So then the question is what do you use for identifying the superblock, and you're switching to the dev_t - interesting. Are we 100% sure that will never break, that a dev_t will always identify a unique block_device? Namespacing has been changing things. > > It also forces the caller to separate opening of the block devices from > > the rest of filesystem initialization, which is a bit less than ideal. > > > > Anyways, bcachefs just wants to be able to do real exclusive opens of > > the block devices, and we do all filesystem bringup with a single > > bch2_fs_open() call. I think this could be made to work with the way > > sget() wants to work, but it'd require reworking the locking in > > sget() - it does everything, including the test() and set() calls, under > > a single spinlock. > > Yeah. Maybe the current upstream changes aren't enough to make your life > easier for bcachefs, btrfs does its special thing as well after all because > mount also involves multiple devices for it. I just wanted to mention that > the exclusive bdev open thing is changing. I like the mount_bdev() approach in your patch a _lot_ better than the old code, I think the approach almost works for multi device filesystems - at least for bcachefs where we always pass in the full list of devices we want to open, there's no kernel side probing like in btrfs. What changes is we'd have to pass a vector of dev_t's to sget(), and set() needs to be able to stash them in super_block (not s_fs_info, we need that for bch_fs later and that doesn't exist yet). But that's a minor detail. Yeah, this could work.
On Thu, Aug 10, 2023 at 03:39:42PM -0700, Darrick J. Wong wrote: > I've said this previously, and I'll say it again: we're severely > under-resourced. Not just XFS, the whole fsdevel community. As a > developer and later a maintainer, I've learnt the hard way that there is > a very large amount of non-coding work is necessary to build a good > filesystem. There's enough not-really-coding work for several people. > Instead, we lean hard on maintainers to do all that work. That might've > worked acceptably for the first 20 years, but it doesn't now. Yeah, that was my takeaway too when I started doing some more travelling last fall to talk to people about bcachefs - the teams are not what they were 10 years ago, and a lot of the effort in the filesystem space feels a lot more fragmented. It feels like there's a real lack of leadership or any kind of a long term plan in the filesystem space, and I think that's one of the causes of all the burnout; we don't have a clear set of priorities or long term goals. > Nowadays we have all these people running bots and AIs throwing a steady > stream of bug reports and CVE reports at Dave [Chinner] and I. Most of > these people *do not* help fix the problems they report. Once in a > while there's an actual *user* report about data loss, but those > (thankfully) aren't the majority of the reports. > > However, every one of these reports has to be triaged, analyzed, and > dealt with. As soon as we clear one, at least one more rolls in. You > know what that means? Dave and I are both in a permanent state of > heightened alert, fear, and stress. We never get to settle back down to > calm. Every time someone brings up syzbot, CVEs, or security? I feel > my own stress response ramping up. I can no longer have "rational" > conversations about syzbot because those discussions push my buttons. > > This is not healthy! Yeah, we really need to take a step back and ask ourselves what we're trying to do here. At this point, I'm not so sure hardening xfs/ext4 in all the ways people are wanting them to be hardened is a realistic idea: these are huge, old C codebases that are tricky to work on, and they weren't designed from the start with these kinds of considerations. Yes, in a perfect world all code should be secure and all bugs should be fixed, but is this the way to do it? Personally, I think we'd be better served by putting what manpower we can spare into starting on an incremental Rust rewrite; at least that's my plan for bcachefs, and something I've been studying for awhile (as soon as the gcc rust stuff lands I'll be adding Rust code to fs/bcachefs, some code already exists). For xfs/ext4, teasing things apart and figuring out how to restructure data structures in a way to pass the borrow checker may not be realistic, I don't know the codebases well enough to say - but clearly the current approach is not working, and these codebases are almost definitely still going to be in use 50 years from now, we need to be coming up with _some_ sort of plan. And if we had a coherent long term plan, maybe that would help with the funding and headcount issues... > A group dynamic that I keep observing around here is that someone tries > to introduce some unfamiliar (or even slightly new) concept, because > they want the kernel to do something it didn't do before. The author > sends out patches for review, and some of the reviewers who show up > sound like they're so afraid of ... something ... that they throw out > vague arguments that something might break. > > [I have had people tell me in private that while they don't have any > specific complaints about online fsck, "something" is wrong and I need > to stop and consider more thoroughly. Consider /what/?] Yup, that's just broken. If you're telling someone they're doing it wrong and you're not offering up any ideas, maybe _you're_ the problem. The fear based thing is very real, and _very_ understandable. In the filesystem world, we have to live with our mistakes in a way no one else in kernel land does. There's no worse feeling than realizing you fucked up something in the on disk format, and you didn't realize it until six months later, and now you've got incompatibilities that are a nightmare to sort out - never mind the more banal "oh fuck, sorry I ate your data" stories. > Or, worse, no reviewers show up. The author merges it, and a month > later there's a freakout because something somewhere else broke. Angry > threads spread around fsdevel because now there's pressure to get it > fixed before -rc8 (in the good case) or ASAP (because now it's > released). Did the author have an incomplete understanding of the code? > Were there potential reviewers who might've said something but bailed? > Yes and yes. > > What do we need to reduce the amount of fear and anger around here, > anyway? 20 years ago when I started my career in Linux I found the work > to be challenging and enjoyable. Now I see a lot more anger, and I am > sad, because there /are/ still enjoyable challenges to be undertaken. > Can we please have that conversation? I've been through the burnout cycle too (many times!), and for me the answer was: slow down, and identify the things that really matter, the things that will make my life easier in the long run, and focus on _that_. I've been through cycles more than once where I wasn't keeping up with bug reports, and I had to tell my users "hang on - this isn't efficient, I need to work on the testing automation because stuff is slipping through; give me a month". (And also make sure to leave some time for the things I actually do enjoy; right now that means working on the fuse port here and there). > People and groups do not do well when they feel like they're under > constant attack, like they have to brace themselves for whatever > bullshit is coming next. That is how I feel most weeks, and I choose > not to do that anymore. > > > and I _really_ > > hope people are taking notice about Darrick stepping away from XFS and > > asking themselves what needs to be sorted out. > > Me too. Ted expressed similar laments about ext4 after I announced my > intention to reduce my own commitments to XFS. Oh man, we can't lose Ted. > > Darrick writes > > meticulous, well documented code; when I think of people who slip by > > hacks other people are going to regret later, he's not one of them. > > I appreciate the compliment. ;) > > From what I can tell (because I lolquit and finally had time to start > scanning the bcachefs code) I really like the thought that you've put > into indexing and record iteration in the filesystem. I appreciate the > amount of work you've put into making it easy and fast to run QA on > bcachefs, even if we don't quite agree on whether or not I should rip > and replace my 20yo Debian crazyquilt. Thanks, the database layer is something I've put a _ton_ of work into. I feel like we're close to being able to get into some really exciting stuff once we get past the "stabilizing a new filesystem with a massive featureset" madness - people have been trying to do the filesystem-as-a-database thing for years, and I think bcachefs is the first to actually seriously pull it off. And I'm really hoping to make the test infrastructure its own real project for the whole fs community, and more. There's a lot of good stuff in there I just need to document better and create a proper website for. > > And yet, online fsck for XFS has been pushed back repeatedly because > > of petty bullshit. > > A broader dynamic here is that I ask people to review the code so that I > can merge it; they say they will do it; and then an entire cycle goes by > without any visible progress. > > When I ask these people why they didn't follow through on their > commitments, the responses I hear are pretty uniform -- they got buried > in root cause analysis of a real bug report but lol there were no other > senior people available; their time ended up being spent on backports or > arguing about backports; or they got caught up in that whole freakout > thing I described above. Yeah, that set of priorities makes sense when we're talking about patches that modify existing code; if you can't keep up with bug reports then you have to slow down on changes, and changes to existing code often do need the meticulous review - and hopefully while people are waiting on code review they'll be helping out with bug reports. But for new code that isn't going to upset existing users, if we trust the author to not do crazy things then code review is really more about making sure someone else understands the code. But if they're putting in all the proper effort to document, to organize things well, to do things responsibly, does it make sense for that level of code review to be an up front requirement? Perhaps we could think a _bit_ more about how we enable people to do good work. I'm sure the XFS people have thought about this more than I have, but given how long this has been taking you and the amount of pushback I feel it ought to be asked.
On Thu, Aug 10, 2023 at 04:47:22PM -0700, Linus Torvalds wrote:
> So I might be barking up entirely the wrong tree.
Yeah, I think you are, it sounds like you're describing an entirely
different sort of race.
The issue here is just that killing off a process should release all the
references it holds, and if we kill off all processes accessing a
filesystem we should be able to unmount it - but in this case we can't,
because fputs() are being delayed asynchronously.
delayed_fput() from AIO turned out to not be an issue in my testing, for
reasons that are unclear to me; flush_delayed_fput() certainly isn't
called in any relevant codepaths. The code _looks_ buggy to me, but I
wasn't able to trigger the bug with AIO.
io_uring adds its own layer of indirect asynchronous reference holding,
and that's why the issue crops up there - but io_uring isn't using
delayed_fput() either.
The patch I posted was to make sure the file ref doesn't outlive the
task - I honestly don't know what you and Jens don't like about that
approach (obviously, adding task->ref gets and puts to fastpaths is a
nonstarter, but that's fixable as mentioned).
On Thu, 10 Aug 2023 at 21:03, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Thu, Aug 10, 2023 at 04:47:22PM -0700, Linus Torvalds wrote: > > So I might be barking up entirely the wrong tree. > > Yeah, I think you are, it sounds like you're describing an entirely > different sort of race. I was just going by Darrick's description of what he saw, which *seemed* to be that umount had finished with stuff still active: "Here, umount exits before the filesystem is really torn down, and then mount fails because it can't get an exclusive lock on the device." But maybe I misunderstood, and the umount wasn't actually successful (ie "exits" may have been "failed with EBUSY")? So I was trying to figure out what could cause the behavior I thought Darrick was describing, which would imply a mnt_count issue. If it's purely "umount doesnt' succeed because the filesystem is still busy with cleanups", then things are much better. The mnt_count is nasty, if it's not that, we're actually much better off, and I'll be very happy to have misunderstood Darrick's case. Linus
On Thu, Aug 10, 2023 at 10:20:22PM -0700, Linus Torvalds wrote: > If it's purely "umount doesnt' succeed because the filesystem is still > busy with cleanups", then things are much better. That's exactly it. We have various tests that kill -9 fio and then umount, and umount spuriously fails.
On Thu, 10 Aug 2023 at 22:29, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Thu, Aug 10, 2023 at 10:20:22PM -0700, Linus Torvalds wrote: > > If it's purely "umount doesnt' succeed because the filesystem is still > > busy with cleanups", then things are much better. > > That's exactly it. We have various tests that kill -9 fio and then > umount, and umount spuriously fails. Well, it sounds like Jens already has some handle on at least one io_uring shutdown case that didn't wait for completion. At the same time, a random -EBUSY is kind of an expected failure in real life, since outside of strictly controlled environments you could easily have just some entirely unrelated thing that just happens to have looked at the filesystem when you tried to unmount it. So any real-life use tends to use umount in a (limited) loop. It might just make sense for the fsstress test scripts to do the same regardless. There's no actual good reason to think that -EBUSY is a hard error. It very much can be transient. In fact, I have this horrible flash-back memory to some auto-expiry scripts that used to do the equivalent of "umount -a -t autofs" every minute or so as a horrible model for expiring things, happy and secure in the knowledge that if the filesystem was still in active use, it would just fail. So may I suggest that even if the immediate issue ends up being sorted out, just from a robustness standpoint the "consider EBUSY a hard error" seems to be a mistake. Transient failures are pretty much expected, and not all of them are necessarily kernel-related (ie think "concurrent updatedb run" or any number of other possibilities). Linus
> So may I suggest that even if the immediate issue ends up being sorted > out, just from a robustness standpoint the "consider EBUSY a hard > error" seems to be a mistake. Especially from umount. The point I was trying to make in the other thread is that this needs fixing in the subsystem that's causing _unnecessary_ spurious EBUSY errors and Jens has been at his right away. What we don't want is for successful umount to be equated with that an immediate mount can never return EBUSY again. I think that's not a guarantee that umount should give and with mount namespaces in the mix you can get your filesystem pinned implicitly somewhere behind your back without you ever noticing it as just one very obvious example. > > Transient failures are pretty much expected Yes, I agree.
On Thu 10-08-23 22:47:03, Kent Overstreet wrote: > On Thu, Aug 10, 2023 at 07:52:05PM +0200, Jan Kara wrote: > > On Thu 10-08-23 11:54:53, Kent Overstreet wrote: > > > > And there clearly is something very strange going on with superblock > > > > handling > > > > > > This deserves an explanation because sget() is a bit nutty. > > > > > > The way sget() is conventionally used for block device filesystems, the > > > block device open _isn't actually exclusive_ - sure, FMODE_EXCL is used, > > > but the holder is the fs type pointer, so it won't exclude with other > > > opens of the same fs type. > > > > > > That means the only protection from multiple opens scribbling over each > > > other is sget() itself - but if the bdev handle ever outlives the > > > superblock we're completely screwed; that's a silent data corruption bug > > > that we can't easily catch, and if the filesystem teardown path has any > > > asynchronous stuff going on (and of course it does) that's not a hard > > > mistake to make. I've observed at least one bug that looked suspiciously > > > like that, but I don't think I quite pinned it down at the time. > > > > This is just being changed - check Christian's VFS tree. There are patches > > that make sget() use superblock pointer as a bdev holder so the reuse > > you're speaking about isn't a problem anymore. > > So then the question is what do you use for identifying the superblock, > and you're switching to the dev_t - interesting. > > Are we 100% sure that will never break, that a dev_t will always > identify a unique block_device? Namespacing has been changing things. Yes, dev_t is a unique identifier of the device, we rely on that in multiple places, block device open comes to mind as the first. You're right namespacing changes things but we implement that as changing what gets presented to userspace via some mapping layer while the kernel keeps using globally unique identifiers. Honza
On Fri, Aug 11, 2023 at 10:10:42AM +0200, Jan Kara wrote: > On Thu 10-08-23 22:47:03, Kent Overstreet wrote: > > On Thu, Aug 10, 2023 at 07:52:05PM +0200, Jan Kara wrote: > > > On Thu 10-08-23 11:54:53, Kent Overstreet wrote: > > > > > And there clearly is something very strange going on with superblock > > > > > handling > > > > > > > > This deserves an explanation because sget() is a bit nutty. > > > > > > > > The way sget() is conventionally used for block device filesystems, the > > > > block device open _isn't actually exclusive_ - sure, FMODE_EXCL is used, > > > > but the holder is the fs type pointer, so it won't exclude with other > > > > opens of the same fs type. > > > > > > > > That means the only protection from multiple opens scribbling over each > > > > other is sget() itself - but if the bdev handle ever outlives the > > > > superblock we're completely screwed; that's a silent data corruption bug > > > > that we can't easily catch, and if the filesystem teardown path has any > > > > asynchronous stuff going on (and of course it does) that's not a hard > > > > mistake to make. I've observed at least one bug that looked suspiciously > > > > like that, but I don't think I quite pinned it down at the time. > > > > > > This is just being changed - check Christian's VFS tree. There are patches > > > that make sget() use superblock pointer as a bdev holder so the reuse > > > you're speaking about isn't a problem anymore. > > > > So then the question is what do you use for identifying the superblock, > > and you're switching to the dev_t - interesting. > > > > Are we 100% sure that will never break, that a dev_t will always > > identify a unique block_device? Namespacing has been changing things. > > Yes, dev_t is a unique identifier of the device, we rely on that in > multiple places, block device open comes to mind as the first. You're > right namespacing changes things but we implement that as changing what > gets presented to userspace via some mapping layer while the kernel keeps > using globally unique identifiers. Full device namespacing is not on the horizon at all. We've looked into this years ago and it woud be a giant effort that would effect nearly everything if the properly. So even if, there would be so many changes required that reliance on dev_t in the VFS would be the least of our problems.
> I don't want to do that to Christian either, I think highly of the work > he's been doing and I don't want to be adding to his frustration. So I > apologize for loosing my cool earlier; a lot of that was frustration > from other threads spilling over. > > But: if he's going to be raising objections, I need to know what his > concerns are if we're going to get anywhere. Raising objections without > saying what the concerns are shuts down discussion; I don't think it's > unreasonable to ask people not to do that, and to try and stay focused > on the code. The technical aspects were made clear off-list and I believe multiple times on-list by now. Any VFS and block related patches are to be reviewed and accepted before bcachefs gets merged. This was also clarified off-list before the pull request was sent. Yet, it was sent anyway. On the receiving end this feels disrespectful. To other maintainers this implies you only accept Linus verdict and expect him to ignore objections of other maintainers and pull it all in. That would've caused massive amounts of frustration and conflict should that have happened. So this whole pull request had massive potential to divide the community. And in the end you were told the same requirements that we did have and then you accepted it but that cannot be the only barrier that you accept. And it's not just all about code. Especially from a maintainer's perspective. There's two lengthy mails from Darrick and from you with detailed excursions about social aspects as well. Social aspects in fact often come into the center whenever we focus on code. There will be changes that a sub-maintainer may think are absolutely required and that infrastructure maintainers will reject for reasons that the sub-maintainer might fundamentally disagree with and we need to be confident that a maintainer can handle this gracefully and respectfully. If there's strong indication to the contrary it's a problem that can't be ignored. To address this issue I did request at LSFMM that I want a co-maintainer for bcachefs that can act as a counterweight and balancing factor. Not just a reviewer but someone who is designated in making decisions in addition to you and can step in. That would be may preferred thing. Timeline wise, my preference would be if we could get the time to finish the super work that Christoph and Jan are currently doing and have a cycle to see how badly the world breaks. And then we aim to merge bcachefs for v6.7 in November. That's really not far away and also gives everyone the time to calm down a little.
On Fri, Aug 11, 2023 at 12:54:42PM +0200, Christian Brauner wrote: > The technical aspects were made clear off-list and I believe multiple > times on-list by now. Any VFS and block related patches are to be > reviewed and accepted before bcachefs gets merged. Christian, you're misrepresenting. The fact is, the _very same person_ who has been most vocal in saying "all patches need to go in prior through maintainers" was also in years past one of the people saying that patches only for bcachefs shouldn't go in until the bcachefs pull. And as well, we also had Linus just looking at the prereq series and saying acks would be fine from Jens. > This was also clarified off-list before the pull request was sent. Yet, > it was sent anyway. All these patches have hit the list multiple times; the one VFS patch is question is a tiny new helper and it's been in your inbox. > On the receiving end this feels disrespectful. To other maintainers this > implies you only accept Linus verdict and expect him to ignore > objections of other maintainers and pull it all in. Well, it is his kernel :) And more than that, I find Linus genuinely more pleasant to deal with; I always feel like I'm talking to someone who's just trying to have an intelligent conversation and doesn't want to waste time on bullshit. Look, in the private pre-pull request thread, within _hours_ he was tearing into six locks and the statistics code. I post that same code to the locking mailing list, and I got - what, a couple comments to clarify? A spelling mistake pointed out? So yeah, I appreciate hearing from him. The code's been out on the mailing list for months and you haven't commented at all. All I need from you is an ack on the dcache helper or a comment saying why you don't like it, and all I'm getting is complaints. > That would've caused massive amounts of frustration and conflict > should that have happened. So this whole pull request had massive > potential to divide the community. Christian, I've been repeatedly asking what your concerns are: we had _two_ meetings set up for you that you noshow'd on. And here you are continuing to make wild conflicts about frustration and conflict, but you can't seem to name anything specific. I don't want to make your life more difficult, but you seem to want to make _mine_ more difficult. You made one offhand comment about not wanting a repeat of ntfs3, and when I asked you for details you never even responded. > Timeline wise, my preference would be if we could get the time to finish > the super work that Christoph and Jan are currently doing and have a > cycle to see how badly the world breaks. And then we aim to merge > bcachefs for v6.7 in November. That's really not far away and also gives > everyone the time to calm down a little. I don't see the justification for the delay - every cycle there's some amount of vfs/block layer refactoring that affects filesystems, the super work is no different.
On Fri, Aug 11, 2023 at 12:54:42PM +0200, Christian Brauner wrote: > > I don't want to do that to Christian either, I think highly of the work > > he's been doing and I don't want to be adding to his frustration. So I > > apologize for loosing my cool earlier; a lot of that was frustration > > from other threads spilling over. > > > > But: if he's going to be raising objections, I need to know what his > > concerns are if we're going to get anywhere. Raising objections without > > saying what the concerns are shuts down discussion; I don't think it's > > unreasonable to ask people not to do that, and to try and stay focused > > on the code. > > The technical aspects were made clear off-list and I believe multiple > times on-list by now. Any VFS and block related patches are to be > reviewed and accepted before bcachefs gets merged. Here's the one VFS patch in the series - could we at least get an ack for this? It's a new helper, just breaks the existing d_tmpfile() up into two functions - I hope we can at least agree that this patch shouldn't be controversial? -->-- Subject: [PATCH] fs: factor out d_mark_tmpfile() New helper for bcachefs - bcachefs doesn't want the inode_dec_link_count() call that d_tmpfile does, it handles i_nlink on its own atomically with other btree updates Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org diff --git a/fs/dcache.c b/fs/dcache.c index 52e6d5fdab..dbdafa2617 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -3249,11 +3249,10 @@ void d_genocide(struct dentry *parent) EXPORT_SYMBOL(d_genocide); -void d_tmpfile(struct file *file, struct inode *inode) +void d_mark_tmpfile(struct file *file, struct inode *inode) { struct dentry *dentry = file->f_path.dentry; - inode_dec_link_count(inode); BUG_ON(dentry->d_name.name != dentry->d_iname || !hlist_unhashed(&dentry->d_u.d_alias) || !d_unlinked(dentry)); @@ -3263,6 +3262,15 @@ void d_tmpfile(struct file *file, struct inode *inode) (unsigned long long)inode->i_ino); spin_unlock(&dentry->d_lock); spin_unlock(&dentry->d_parent->d_lock); +} +EXPORT_SYMBOL(d_mark_tmpfile); + +void d_tmpfile(struct file *file, struct inode *inode) +{ + struct dentry *dentry = file->f_path.dentry; + + inode_dec_link_count(inode); + d_mark_tmpfile(file, inode); d_instantiate(dentry, inode); } EXPORT_SYMBOL(d_tmpfile); diff --git a/include/linux/dcache.h b/include/linux/dcache.h index 6b351e009f..3da2f0545d 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -251,6 +251,7 @@ extern struct dentry * d_make_root(struct inode *); /* <clickety>-<click> the ramfs-type tree */ extern void d_genocide(struct dentry *); +extern void d_mark_tmpfile(struct file *, struct inode *); extern void d_tmpfile(struct file *, struct inode *); extern struct dentry *d_find_alias(struct inode *);
On 8/10/23 11:53 PM, Linus Torvalds wrote: > On Thu, 10 Aug 2023 at 22:29, Kent Overstreet <kent.overstreet@linux.dev> wrote: >> >> On Thu, Aug 10, 2023 at 10:20:22PM -0700, Linus Torvalds wrote: >>> If it's purely "umount doesnt' succeed because the filesystem is still >>> busy with cleanups", then things are much better. >> >> That's exactly it. We have various tests that kill -9 fio and then >> umount, and umount spuriously fails. > > Well, it sounds like Jens already has some handle on at least one > io_uring shutdown case that didn't wait for completion. > > At the same time, a random -EBUSY is kind of an expected failure in > real life, since outside of strictly controlled environments you could > easily have just some entirely unrelated thing that just happens to > have looked at the filesystem when you tried to unmount it. > > So any real-life use tends to use umount in a (limited) loop. It might > just make sense for the fsstress test scripts to do the same > regardless. > > There's no actual good reason to think that -EBUSY is a hard error. It > very much can be transient. Indeed, any production kind of workload would have some kind of graceful handling for that. That doesn't mean we should not fix the delayed fput to avoid it if we can, just that it might make sense to have an xfstest helper that at least tries X times with a sync in between or something like that.
On Fri, Aug 11, 2023 at 09:21:41AM -0400, Kent Overstreet wrote: > On Fri, Aug 11, 2023 at 12:54:42PM +0200, Christian Brauner wrote: > > > I don't want to do that to Christian either, I think highly of the work > > > he's been doing and I don't want to be adding to his frustration. So I > > > apologize for loosing my cool earlier; a lot of that was frustration > > > from other threads spilling over. > > > > > > But: if he's going to be raising objections, I need to know what his > > > concerns are if we're going to get anywhere. Raising objections without > > > saying what the concerns are shuts down discussion; I don't think it's > > > unreasonable to ask people not to do that, and to try and stay focused > > > on the code. > > > > The technical aspects were made clear off-list and I believe multiple > > times on-list by now. Any VFS and block related patches are to be > > reviewed and accepted before bcachefs gets merged. > > Here's the one VFS patch in the series - could we at least get an ack > for this? It's a new helper, just breaks the existing d_tmpfile() up > into two functions - I hope we can at least agree that this patch > shouldn't be controversial? > > -->-- > Subject: [PATCH] fs: factor out d_mark_tmpfile() > > New helper for bcachefs - bcachefs doesn't want the > inode_dec_link_count() call that d_tmpfile does, it handles i_nlink on > its own atomically with other btree updates > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> > Cc: Alexander Viro <viro@zeniv.linux.org.uk> > Cc: Christian Brauner <brauner@kernel.org> > Cc: linux-fsdevel@vger.kernel.org Yes, we can finally clean up this braindamage in xfs_generic_create: if (tmpfile) { /* * The VFS requires that any inode fed to d_tmpfile must * have nlink == 1 so that it can decrement the nlink in * d_tmpfile. However, we created the temp file with * nlink == 0 because we're not allowed to put an inode * with nlink > 0 on the unlinked list. Therefore we * have to set nlink to 1 so that d_tmpfile can * immediately set it back to zero. */ set_nlink(inode, 1); d_tmpfile(tmpfile, inode); } Reviewed-by: Darrick J. Wong <djwong@kernel.org> --D > > diff --git a/fs/dcache.c b/fs/dcache.c > index 52e6d5fdab..dbdafa2617 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -3249,11 +3249,10 @@ void d_genocide(struct dentry *parent) > > EXPORT_SYMBOL(d_genocide); > > -void d_tmpfile(struct file *file, struct inode *inode) > +void d_mark_tmpfile(struct file *file, struct inode *inode) > { > struct dentry *dentry = file->f_path.dentry; > > - inode_dec_link_count(inode); > BUG_ON(dentry->d_name.name != dentry->d_iname || > !hlist_unhashed(&dentry->d_u.d_alias) || > !d_unlinked(dentry)); > @@ -3263,6 +3262,15 @@ void d_tmpfile(struct file *file, struct inode *inode) > (unsigned long long)inode->i_ino); > spin_unlock(&dentry->d_lock); > spin_unlock(&dentry->d_parent->d_lock); > +} > +EXPORT_SYMBOL(d_mark_tmpfile); > + > +void d_tmpfile(struct file *file, struct inode *inode) > +{ > + struct dentry *dentry = file->f_path.dentry; > + > + inode_dec_link_count(inode); > + d_mark_tmpfile(file, inode); > d_instantiate(dentry, inode); > } > EXPORT_SYMBOL(d_tmpfile); > diff --git a/include/linux/dcache.h b/include/linux/dcache.h > index 6b351e009f..3da2f0545d 100644 > --- a/include/linux/dcache.h > +++ b/include/linux/dcache.h > @@ -251,6 +251,7 @@ extern struct dentry * d_make_root(struct inode *); > /* <clickety>-<click> the ramfs-type tree */ > extern void d_genocide(struct dentry *); > > +extern void d_mark_tmpfile(struct file *, struct inode *); > extern void d_tmpfile(struct file *, struct inode *); > > extern struct dentry *d_find_alias(struct inode *);
On Fri, Aug 11, 2023 at 09:21:41AM -0400, Kent Overstreet wrote: > On Fri, Aug 11, 2023 at 12:54:42PM +0200, Christian Brauner wrote: > > > I don't want to do that to Christian either, I think highly of the work > > > he's been doing and I don't want to be adding to his frustration. So I > > > apologize for loosing my cool earlier; a lot of that was frustration > > > from other threads spilling over. > > > > > > But: if he's going to be raising objections, I need to know what his > > > concerns are if we're going to get anywhere. Raising objections without > > > saying what the concerns are shuts down discussion; I don't think it's > > > unreasonable to ask people not to do that, and to try and stay focused > > > on the code. > > > > The technical aspects were made clear off-list and I believe multiple > > times on-list by now. Any VFS and block related patches are to be > > reviewed and accepted before bcachefs gets merged. > > Here's the one VFS patch in the series - could we at least get an ack > for this? It's a new helper, just breaks the existing d_tmpfile() up > into two functions - I hope we can at least agree that this patch > shouldn't be controversial? > > -->-- > Subject: [PATCH] fs: factor out d_mark_tmpfile() > > New helper for bcachefs - bcachefs doesn't want the > inode_dec_link_count() call that d_tmpfile does, it handles i_nlink on > its own atomically with other btree updates > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> > Cc: Alexander Viro <viro@zeniv.linux.org.uk> > Cc: Christian Brauner <brauner@kernel.org> > Cc: linux-fsdevel@vger.kernel.org Yep, that looks good, Reviewed-by: Christian Brauner <brauner@kernel.org>
On Fri, Aug 11, 2023 at 08:58:01AM -0400, Kent Overstreet wrote: > I don't see the justification for the delay - every cycle there's some > amount of vfs/block layer refactoring that affects filesystems, the > super work is no different. So, the reason is that we're very close to having the super code massaged in a shape where bcachefs should be able to directly make use of the helpers instead of having to pull in custom code at all. But not all that work has made it.
On Mon, Aug 14, 2023 at 09:25:54AM +0200, Christian Brauner wrote: > On Fri, Aug 11, 2023 at 08:58:01AM -0400, Kent Overstreet wrote: > > I don't see the justification for the delay - every cycle there's some > > amount of vfs/block layer refactoring that affects filesystems, the > > super work is no different. > > So, the reason is that we're very close to having the super code > massaged in a shape where bcachefs should be able to directly make use > of the helpers instead of having to pull in custom code at all. But not > all that work has made it. Well, bcachefs really isn't doing anything terribly unusual here; we're using sget() directly, same as btrfs, and we have to because we're both multi device filesystems. Jan's restructing of mount_bdev() got me thinking that it should be possible to do a mount_bdevs() that both btrfs and bcachefs could use - but we don't need to be blocked on that, sget()'s been a normal exported interface since forever. Somewhat related, I dropped this patch from my tree: block: Don't block on s_umount from __invalidate_super() https://evilpiepirate.org/git/bcachefs.git/commit/?h=bcachefs-v6.3&id=1dd488901bc025a61e1ce1a0f54999a2b221bd78 and instead, for now we're closing block devices later in the shutdown path like other filesystems do (after calling generic_shutdown_super(), not in put_super()). But now I've got some test failures, e.g. https://evilpiepirate.org/~testdashboard/c/040e910f7f316ea6273c895dcc026b9f1ad36a8e/xfstests.generic.604/log.br and since you guys are switching block device opens to use a real holder, I suspect you'll be seeing the same issue soon. The bug is that the mount appears to be gone - generic_shutdown_super() is finished - so as far as userspace can tell everything is shutdown and we should be able to start using the block device again, but the unmount path hasn't actually called blkdev_put() yet. So that patch I posted is one way to solve the self-deadlock from calling blkdev_put() where we really want to be calling it... not the prettiest way, but I think this is something we do need to get fixed.
On Mon, Aug 14, 2023 at 09:21:22AM +0200, Christian Brauner wrote: > On Fri, Aug 11, 2023 at 09:21:41AM -0400, Kent Overstreet wrote: > > On Fri, Aug 11, 2023 at 12:54:42PM +0200, Christian Brauner wrote: > > > > I don't want to do that to Christian either, I think highly of the work > > > > he's been doing and I don't want to be adding to his frustration. So I > > > > apologize for loosing my cool earlier; a lot of that was frustration > > > > from other threads spilling over. > > > > > > > > But: if he's going to be raising objections, I need to know what his > > > > concerns are if we're going to get anywhere. Raising objections without > > > > saying what the concerns are shuts down discussion; I don't think it's > > > > unreasonable to ask people not to do that, and to try and stay focused > > > > on the code. > > > > > > The technical aspects were made clear off-list and I believe multiple > > > times on-list by now. Any VFS and block related patches are to be > > > reviewed and accepted before bcachefs gets merged. > > > > Here's the one VFS patch in the series - could we at least get an ack > > for this? It's a new helper, just breaks the existing d_tmpfile() up > > into two functions - I hope we can at least agree that this patch > > shouldn't be controversial? > > > > -->-- > > Subject: [PATCH] fs: factor out d_mark_tmpfile() > > > > New helper for bcachefs - bcachefs doesn't want the > > inode_dec_link_count() call that d_tmpfile does, it handles i_nlink on > > its own atomically with other btree updates > > > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> > > Cc: Alexander Viro <viro@zeniv.linux.org.uk> > > Cc: Christian Brauner <brauner@kernel.org> > > Cc: linux-fsdevel@vger.kernel.org > > Yep, that looks good, > Reviewed-by: Christian Brauner <brauner@kernel.org> Thanks, much appreciated
[Been on PTO this last week and a half] On Thu, Aug 10, 2023 at 11:45:26PM -0400, Kent Overstreet wrote: > On Thu, Aug 10, 2023 at 03:39:42PM -0700, Darrick J. Wong wrote: > > Nowadays we have all these people running bots and AIs throwing a steady > > stream of bug reports and CVE reports at Dave [Chinner] and I. Most of > > these people *do not* help fix the problems they report. Once in a > > while there's an actual *user* report about data loss, but those > > (thankfully) aren't the majority of the reports. > > > > However, every one of these reports has to be triaged, analyzed, and > > dealt with. As soon as we clear one, at least one more rolls in. You > > know what that means? Dave and I are both in a permanent state of > > heightened alert, fear, and stress. We never get to settle back down to > > calm. Every time someone brings up syzbot, CVEs, or security? I feel > > my own stress response ramping up. I can no longer have "rational" > > conversations about syzbot because those discussions push my buttons. > > > > This is not healthy! > > Yeah, we really need to take a step back and ask ourselves what we're > trying to do here. > > At this point, I'm not so sure hardening xfs/ext4 in all the ways people > are wanting them to be hardened is a realistic idea: these are huge, old > C codebases that are tricky to work on, and they weren't designed from > the start with these kinds of considerations. Yes, in a perfect world > all code should be secure and all bugs should be fixed, but is this the > way to do it? Look at it this way: For XFS we've already done the hardening work - we started that way back in 2008 when we started planning for the V5 filesystem format to avoid all the random bit failures that were occuring out there in the real world and from academic fuzzer research. The problem with syzbot has been that has been testing the old V4 format and it keeps tripping over different symptoms of the same problems the v5 format either isn't susceptible to or it detects and fixes/shuts down the filesystem. Since syzbot finally turned off v4 format testing on the 3rd July, we haven't had a single new syzbot bug report on XFS. I don't expect syzbot to find a significant amount of new issues on XFS from this point onwards... So, yeah, I think we did the bulk of the possible format verification/hardening work in XFS a decade ago, and the stream of bugs we've been seeing is due to intentionally ignoring the format that actually provides some defences against random bit manipulation based failures... > Personally, I think we'd be better served by putting what manpower we > can spare into starting on an incremental Rust rewrite; at least that's > my plan for bcachefs, and something I've been studying for awhile (as > soon as the gcc rust stuff lands I'll be adding Rust code to > fs/bcachefs, some code already exists). For xfs/ext4, teasing things > apart and figuring out how to restructure data structures in a way to > pass the borrow checker may not be realistic, I don't know the codebases > well enough to say - but clearly the current approach is not working, > and these codebases are almost definitely still going to be in use 50 > years from now, we need to be coming up with _some_ sort of plan. For XFS, my plan for the past couple of years has been to start with rewriting chunks of the userspace code in rust. That shares the core libxfs code with the kernel, so the idea is that we slowly reimplement bits of libxfs in userspace in rust where we have the freedom to just rip and tear the code apart. Then when we have something that largely works we can pull that core libxfs rust code back into the kernel as rust support improves. Of course, that's been largely put on the back burner over the past year or so because of all the other demands on my time stuff like dealing with 1-2 syzbot bug reports a week have resulted in.... > And if we had a coherent long term plan, maybe that would help with the > funding and headcount issues... I don't think a lack of a plan is the problem with funding and headcount. At it's core, the problem is inherent in the capitalism model that is "funding" the "community" - squeeze the most you can from as little as possible and externalise the costs as much as possible. Burning people out is essentially externalising the human cost of corporate bean counter optimisation of the bottom line... If I had a dollar for every time I'd been told "we don't have the money for more resources" whilst both company revenue and profits are continually going up, we could pay for another engineer... [....] > > A broader dynamic here is that I ask people to review the code so that I > > can merge it; they say they will do it; and then an entire cycle goes by > > without any visible progress. > > > > When I ask these people why they didn't follow through on their > > commitments, the responses I hear are pretty uniform -- they got buried > > in root cause analysis of a real bug report but lol there were no other > > senior people available; their time ended up being spent on backports or > > arguing about backports; or they got caught up in that whole freakout > > thing I described above. > > Yeah, that set of priorities makes sense when we're talking about > patches that modify existing code; if you can't keep up with bug reports > then you have to slow down on changes, and changes to existing code > often do need the meticulous review - and hopefully while people are > waiting on code review they'll be helping out with bug reports. > > But for new code that isn't going to upset existing users, if we trust > the author to not do crazy things then code review is really more about > making sure someone else understands the code. But if they're putting in > all the proper effort to document, to organize things well, to do things > responsibly, does it make sense for that level of code review to be an > up front requirement? Perhaps we could think a _bit_ more about how we > enable people to do good work. That's pretty much the review rationale I'm using for the online fsck code. I really only care in detail about how it interfaces with the core XFS infrastructure, and as long as the rest of it makes sense and doesn't make my eyes bleed then it's good enough. That doesn't change the fact that it takes me at least a week to read through 10,000 lines of code in sufficient rigour to form an opinion on it, and that's before I know what I need to look at in more detail. So however you look at it, even a "good enough" review of 50,000 lines of new code (the size of online fsck) still requires a couple of months of review time for someone who knows the subsystem intimately... > I'm sure the XFS people have thought about this more than I have, but > given how long this has been taking you and the amount of pushback I > feel it ought to be asked. Certainly we have, and for a long time the merge criteria for code that is tagged as EXPERIMENTAL has been lower (i.e. good enough) and that's how we merged things like the v5 format, reflink support, rmap support, etc without huge amounts of review friction. The problem is that drive over the past few years for more intense review because CI and bot driven testing with no regression policies has really pushed common sense out the window. These days it feels like we're only allowed to merge "perfect" code otherwise the code is "insecure" (e.g. the policies being advocated by syzbot developers). Hence review over the past few years got more finicky and picky because of the fear that regressions will be introduced with new code. This is a direct result of it being drummed into developers that regressions and CI failures must be avoided at -all costs-. i.e. the policies and testing infrastructure that is being used to "validate" these large software projects is pushing us hard towards the "code must be perfect at first attempt" side of the coin rather than more the more practical (and achievable) "good enough" bar. CI is useful and good for code quality, but common sense has to prevail at some point.... -Dave.