Message ID | cover.1715616501.git.dsterba@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [GIT,PULL] Btrfs updates for 6.10 | expand |
The pull request you sent on Mon, 13 May 2024 18:20:55 +0200:
> git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.10-tag
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/a3d1f54d7aa4c3be2c6a10768d4ffa1dcb620da9
Thank you!
On Mon, 13 May 2024 at 09:28, David Sterba <dsterba@suse.com> wrote: > > git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.10-tag So I initially blamed a GPU driver for the following problem, but Dave Airlie seems to think it's unlikely that problem would cause this kind of corruption, so now it looks like it might just be btrfs itself: BUG: Bad page state in process kworker/u261:13 pfn:31fb9a page: refcount:0 mapcount:0 mapping:00000000ff0b239e index:0x37ce8 pfn:0x31fb9a aops:btree_aops ino:1 flags: 0x2fffc600000020c(referenced|uptodate|workingset|node=0|zone=2|lastcpupid=0x3fff) page_type: 0xffffffff() raw: 02fffc600000020c dead000000000100 dead000000000122 ffff9b191efb0338 raw: 0000000000037ce8 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: non-NULL mapping CPU: 18 PID: 141351 Comm: kworker/u261:13 Tainted: G W 6.9.0-07381-g3860ca371740 #60 Workqueue: btrfs-delayed-meta btrfs_work_helper Call Trace: bad_page+0xe0/0xf0 free_unref_page_prepare+0x363/0x380 ? __count_memcg_events+0x63/0xd0 free_unref_page+0x33/0x1f0 ? __mem_cgroup_uncharge+0x80/0xb0 __folio_put+0x62/0x80 release_extent_buffer+0xad/0x110 btrfs_force_cow_block+0x68f/0x890 btrfs_cow_block+0xe5/0x240 btrfs_search_slot+0x30e/0x9f0 btrfs_lookup_inode+0x31/0xb0 __btrfs_update_delayed_inode+0x5c/0x350 ? kfree+0x80/0x250 __btrfs_commit_inode_delayed_items+0x7a1/0x7d0 btrfs_async_run_delayed_root+0xf7/0x1b0 btrfs_work_helper+0xc0/0x320 process_scheduled_works+0x196/0x360 worker_thread+0x2b8/0x370 ? pr_cont_work+0x190/0x190 kthread+0x111/0x120 ? kthread_blkcg+0x30/0x30 ret_from_fork+0x30/0x40 ? kthread_blkcg+0x30/0x30 ret_from_fork_asm+0x11/0x20 Note the line page dumped because: non-NULL mapping but the actual mapping pointer isn't a valid kernel pointer. I suspect that may be due to pointer hashing, though. I'm not convinced that's a great idea for this case, but hey, here we are. Sometimes those "don't leak kernel pointers" things cause problems for debugging. Anyway, it looks like the btrfs_cow_block -> btrfs_force_cow_block -> release_extent_buffer -> __folio_put path might be releasing a page that is still attached to a mapping. Perhaps some page counting imbalance? This all happened under fairly normal - for me - workstation loads. I was (of course) doing an allmodconfig kernel build after a pull, and I had a handful of terminals and the web browser open. Nothing particularly interesting or odd. Does the above make any btrfs people go "Ahh, I see how that would be a problem"? Linus
在 2024/5/16 10:01, Linus Torvalds 写道: > On Mon, 13 May 2024 at 09:28, David Sterba <dsterba@suse.com> wrote: >> >> git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.10-tag > > So I initially blamed a GPU driver for the following problem, but Dave > Airlie seems to think it's unlikely that problem would cause this kind > of corruption, so now it looks like it might just be btrfs itself: > > BUG: Bad page state in process kworker/u261:13 pfn:31fb9a > page: refcount:0 mapcount:0 mapping:00000000ff0b239e index:0x37ce8 > pfn:0x31fb9a > aops:btree_aops ino:1 > flags: 0x2fffc600000020c(referenced|uptodate|workingset|node=0|zone=2|lastcpupid=0x3fff) > page_type: 0xffffffff() > raw: 02fffc600000020c dead000000000100 dead000000000122 ffff9b191efb0338 > raw: 0000000000037ce8 0000000000000000 00000000ffffffff 0000000000000000 > page dumped because: non-NULL mapping > CPU: 18 PID: 141351 Comm: kworker/u261:13 Tainted: G W > 6.9.0-07381-g3860ca371740 #60 > Workqueue: btrfs-delayed-meta btrfs_work_helper > Call Trace: > bad_page+0xe0/0xf0 > free_unref_page_prepare+0x363/0x380 > ? __count_memcg_events+0x63/0xd0 > free_unref_page+0x33/0x1f0 > ? __mem_cgroup_uncharge+0x80/0xb0 > __folio_put+0x62/0x80 > release_extent_buffer+0xad/0x110 > btrfs_force_cow_block+0x68f/0x890 > btrfs_cow_block+0xe5/0x240 > btrfs_search_slot+0x30e/0x9f0 > btrfs_lookup_inode+0x31/0xb0 > __btrfs_update_delayed_inode+0x5c/0x350 > ? kfree+0x80/0x250 > __btrfs_commit_inode_delayed_items+0x7a1/0x7d0 > btrfs_async_run_delayed_root+0xf7/0x1b0 > btrfs_work_helper+0xc0/0x320 > process_scheduled_works+0x196/0x360 > worker_thread+0x2b8/0x370 > ? pr_cont_work+0x190/0x190 > kthread+0x111/0x120 > ? kthread_blkcg+0x30/0x30 > ret_from_fork+0x30/0x40 > ? kthread_blkcg+0x30/0x30 > ret_from_fork_asm+0x11/0x20 > > Note the line > > page dumped because: non-NULL mapping > > but the actual mapping pointer isn't a valid kernel pointer. I suspect > that may be due to pointer hashing, though. I'm not convinced that's a > great idea for this case, but hey, here we are. Sometimes those "don't > leak kernel pointers" things cause problems for debugging. > > Anyway, it looks like the btrfs_cow_block -> btrfs_force_cow_block -> > release_extent_buffer -> __folio_put path might be releasing a page > that is still attached to a mapping. Perhaps some page counting > imbalance? > > This all happened under fairly normal - for me - workstation loads. I > was (of course) doing an allmodconfig kernel build after a pull, and I > had a handful of terminals and the web browser open. Nothing > particularly interesting or odd. Considering aarch64 is going more and more common, is the workstation also an aarch64 platform? (the Ampere one?) If so, mind to share the page size and the fs sectorsize? That would at least help us to know if it's the subpage routine or the regular routine. Thanks, Qu > > Does the above make any btrfs people go "Ahh, I see how that would be > a problem"? > > Linus >
On Thu, 16 May 2024 at 02:02, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > Considering aarch64 is going more and more common, is the workstation > also an aarch64 platform? (the Ampere one?) No, this happened on my regular old AMD Threadripper. Linus