Message ID | 5c6c1cefa7df5fbe9c4dc2fe517f521760a2f4be.1674492773.git.fdmanana@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: lock the inode in shared mode before starting fiemap | expand |
On Mon, Jan 23, 2023 at 04:54:46PM +0000, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > Currently fiemap does not take the inode's lock (VFS lock), it only locks > a file range in the inode's io tree. This however can lead to a deadlock > if we have a concurrent fsync on the file and fiemap code triggers a fault > when accessing the user space buffer with fiemap_fill_next_extent(). The > deadlock happens on the inode's i_mmap_lock semaphore, which is taken both > by fsync and btrfs_page_mkwrite(). This deadlock was recently reported by > syzbot and triggers a trace like the following: > > task:syz-executor361 state:D stack:20264 pid:5668 ppid:5119 flags:0x00004004 > Call Trace: > <TASK> > context_switch kernel/sched/core.c:5293 [inline] > __schedule+0x995/0xe20 kernel/sched/core.c:6606 > schedule+0xcb/0x190 kernel/sched/core.c:6682 > wait_on_state fs/btrfs/extent-io-tree.c:707 [inline] > wait_extent_bit+0x577/0x6f0 fs/btrfs/extent-io-tree.c:751 > lock_extent+0x1c2/0x280 fs/btrfs/extent-io-tree.c:1742 > find_lock_delalloc_range+0x4e6/0x9c0 fs/btrfs/extent_io.c:488 > writepage_delalloc+0x1ef/0x540 fs/btrfs/extent_io.c:1863 > __extent_writepage+0x736/0x14e0 fs/btrfs/extent_io.c:2174 > extent_write_cache_pages+0x983/0x1220 fs/btrfs/extent_io.c:3091 > extent_writepages+0x219/0x540 fs/btrfs/extent_io.c:3211 > do_writepages+0x3c3/0x680 mm/page-writeback.c:2581 > filemap_fdatawrite_wbc+0x11e/0x170 mm/filemap.c:388 > __filemap_fdatawrite_range mm/filemap.c:421 [inline] > filemap_fdatawrite_range+0x175/0x200 mm/filemap.c:439 > btrfs_fdatawrite_range fs/btrfs/file.c:3850 [inline] > start_ordered_ops fs/btrfs/file.c:1737 [inline] > btrfs_sync_file+0x4ff/0x1190 fs/btrfs/file.c:1839 > generic_write_sync include/linux/fs.h:2885 [inline] > btrfs_do_write_iter+0xcd3/0x1280 fs/btrfs/file.c:1684 > call_write_iter include/linux/fs.h:2189 [inline] > new_sync_write fs/read_write.c:491 [inline] > vfs_write+0x7dc/0xc50 fs/read_write.c:584 > ksys_write+0x177/0x2a0 fs/read_write.c:637 > do_syscall_x64 arch/x86/entry/common.c:50 [inline] > do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 > entry_SYSCALL_64_after_hwframe+0x63/0xcd > RIP: 0033:0x7f7d4054e9b9 > RSP: 002b:00007f7d404fa2f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 > RAX: ffffffffffffffda RBX: 00007f7d405d87a0 RCX: 00007f7d4054e9b9 > RDX: 0000000000000090 RSI: 0000000020000000 RDI: 0000000000000006 > RBP: 00007f7d405a51d0 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000246 R12: 61635f65646f6e69 > R13: 65646f7475616f6e R14: 7261637369646f6e R15: 00007f7d405d87a8 > </TASK> > INFO: task syz-executor361:5697 blocked for more than 145 seconds. > Not tainted 6.2.0-rc3-syzkaller-00376-g7c6984405241 #0 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:syz-executor361 state:D stack:21216 pid:5697 ppid:5119 flags:0x00004004 > Call Trace: > <TASK> > context_switch kernel/sched/core.c:5293 [inline] > __schedule+0x995/0xe20 kernel/sched/core.c:6606 > schedule+0xcb/0x190 kernel/sched/core.c:6682 > rwsem_down_read_slowpath+0x5f9/0x930 kernel/locking/rwsem.c:1095 > __down_read_common+0x54/0x2a0 kernel/locking/rwsem.c:1260 > btrfs_page_mkwrite+0x417/0xc80 fs/btrfs/inode.c:8526 > do_page_mkwrite+0x19e/0x5e0 mm/memory.c:2947 > wp_page_shared+0x15e/0x380 mm/memory.c:3295 > handle_pte_fault mm/memory.c:4949 [inline] > __handle_mm_fault mm/memory.c:5073 [inline] > handle_mm_fault+0x1b79/0x26b0 mm/memory.c:5219 > do_user_addr_fault+0x69b/0xcb0 arch/x86/mm/fault.c:1428 > handle_page_fault arch/x86/mm/fault.c:1519 [inline] > exc_page_fault+0x7a/0x110 arch/x86/mm/fault.c:1575 > asm_exc_page_fault+0x22/0x30 arch/x86/include/asm/idtentry.h:570 > RIP: 0010:copy_user_short_string+0xd/0x40 arch/x86/lib/copy_user_64.S:233 > Code: 74 0a 89 (...) > RSP: 0018:ffffc9000570f330 EFLAGS: 00050202 > RAX: ffffffff843e6601 RBX: 00007fffffffefc8 RCX: 0000000000000007 > RDX: 0000000000000000 RSI: ffffc9000570f3e0 RDI: 0000000020000120 > RBP: ffffc9000570f490 R08: 0000000000000000 R09: fffff52000ae1e83 > R10: fffff52000ae1e83 R11: 1ffff92000ae1e7c R12: 0000000000000038 > R13: ffffc9000570f3e0 R14: 0000000020000120 R15: ffffc9000570f3e0 > copy_user_generic arch/x86/include/asm/uaccess_64.h:37 [inline] > raw_copy_to_user arch/x86/include/asm/uaccess_64.h:58 [inline] > _copy_to_user+0xe9/0x130 lib/usercopy.c:34 > copy_to_user include/linux/uaccess.h:169 [inline] > fiemap_fill_next_extent+0x22e/0x410 fs/ioctl.c:144 > emit_fiemap_extent+0x22d/0x3c0 fs/btrfs/extent_io.c:3458 > fiemap_process_hole+0xa00/0xad0 fs/btrfs/extent_io.c:3716 > extent_fiemap+0xe27/0x2100 fs/btrfs/extent_io.c:3922 > btrfs_fiemap+0x172/0x1e0 fs/btrfs/inode.c:8209 > ioctl_fiemap fs/ioctl.c:219 [inline] > do_vfs_ioctl+0x185b/0x2980 fs/ioctl.c:810 > __do_sys_ioctl fs/ioctl.c:868 [inline] > __se_sys_ioctl+0x83/0x170 fs/ioctl.c:856 > do_syscall_x64 arch/x86/entry/common.c:50 [inline] > do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 > entry_SYSCALL_64_after_hwframe+0x63/0xcd > RIP: 0033:0x7f7d4054e9b9 > RSP: 002b:00007f7d390d92f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > RAX: ffffffffffffffda RBX: 00007f7d405d87b0 RCX: 00007f7d4054e9b9 > RDX: 0000000020000100 RSI: 00000000c020660b RDI: 0000000000000005 > RBP: 00007f7d405a51d0 R08: 00007f7d390d9700 R09: 0000000000000000 > R10: 00007f7d390d9700 R11: 0000000000000246 R12: 61635f65646f6e69 > R13: 65646f7475616f6e R14: 7261637369646f6e R15: 00007f7d405d87b8 > </TASK> > > What happens is the following: > > 1) Task A is doing an fsync, enters btrfs_sync_file() and flushes delalloc > before locking the inode and the i_mmap_lock semaphore, that is, before > calling btrfs_inode_lock(); > > 2) After task A flushes delalloc and before it calls btrfs_inode_lock(), > another task dirties a page; > > 3) Task B starts a fiemap without FIEMAP_FLAG_SYNC, so the page dirtied > at step 2 remains dirty and unflushed. Then when it enters > extent_fiemap() and it locks a file range that includes the range of > the page dirtied in step 2; > > 4) Task A calls btrfs_inode_lock() and locks the inode (VFS lock) and the > inode's i_mmap_lock semaphore in write mode. Then it tries to flush > delalloc by calling start_ordered_ops(), which will block, at > find_lock_delalloc_range(), when trying to lock the range of the page > dirtied at step 2, since this range was locked by the fiemap task (at > step 3); > > 5) Task B generates a page fault when accessing the user space fiemap > buffer with a call to fiemap_fill_next_extent(). > > The fault handler needs to call btrfs_page_mkwrite() for some other > page of our inode, and there we deadlock when trying to lock the > inode's i_mmap_lock semaphore in read mode, since the fsync task locked > it in write mode (step 4) and the fsync task can not progress because > it's waiting to lock a file range that is currently locked by us (the > fiemap task, step 3). > > Fix this by taking the inode's lock (VFS lock) in shared mode when > entering fiemap. This effectively serializes fiemap with fsync (except the > most expensive part of fsync, the log sync), preventing this deadlock. > > Reported-by: syzbot+cc35f55c41e34c30dcb5@syzkaller.appspotmail.com > Link: https://lore.kernel.org/linux-btrfs/00000000000032dc7305f2a66f46@google.com/ > Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
On Mon, Jan 23, 2023 at 04:54:46PM +0000, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > Fix this by taking the inode's lock (VFS lock) in shared mode when > entering fiemap. This effectively serializes fiemap with fsync (except the > most expensive part of fsync, the log sync), preventing this deadlock. Could this be a problem, when a continuous fiemap would block fsync on a file? Fsync is more important and I'd give it priority when the inode lock is contended, fiemap is only informative and best effort. The deadlock needs to be fixed of course but some mechanism should be in place to favor other lock holders than fiemap. Quick idea is to relock after each say 100 extents.
On Tue, Feb 7, 2023 at 3:46 PM David Sterba <dsterba@suse.cz> wrote: > > On Mon, Jan 23, 2023 at 04:54:46PM +0000, fdmanana@kernel.org wrote: > > From: Filipe Manana <fdmanana@suse.com> > > > Fix this by taking the inode's lock (VFS lock) in shared mode when > > entering fiemap. This effectively serializes fiemap with fsync (except the > > most expensive part of fsync, the log sync), preventing this deadlock. > > Could this be a problem, when a continuous fiemap would block fsync on a > file? Fsync is more important and I'd give it priority when the inode > lock is contended, fiemap is only informative and best effort. The > deadlock needs to be fixed of course but some mechanism should be in > place to favor other lock holders than fiemap. Quick idea is to relock > after each say 100 extents. Typically fiemap is called with a small buffer for extent entries. cp and filefrag for example use a small buffer. So they repeatedly call fiemap with such a small buffer (not more than a 100 or 200 entries). Do you think it's that common to have fiemap and fsync attempts in parallel? I don't think it's a problem, not only it shouldn't be common, but if it happens it's for a very short time. Even for a very long fiemap buffer.
On Tue, Feb 07, 2023 at 04:15:00PM +0000, Filipe Manana wrote: > On Tue, Feb 7, 2023 at 3:46 PM David Sterba <dsterba@suse.cz> wrote: > > > > On Mon, Jan 23, 2023 at 04:54:46PM +0000, fdmanana@kernel.org wrote: > > > From: Filipe Manana <fdmanana@suse.com> > > > > > Fix this by taking the inode's lock (VFS lock) in shared mode when > > > entering fiemap. This effectively serializes fiemap with fsync (except the > > > most expensive part of fsync, the log sync), preventing this deadlock. > > > > Could this be a problem, when a continuous fiemap would block fsync on a > > file? Fsync is more important and I'd give it priority when the inode > > lock is contended, fiemap is only informative and best effort. The > > deadlock needs to be fixed of course but some mechanism should be in > > place to favor other lock holders than fiemap. Quick idea is to relock > > after each say 100 extents. > > Typically fiemap is called with a small buffer for extent entries. cp > and filefrag for example use a small buffer. > So they repeatedly call fiemap with such a small buffer (not more than > a 100 or 200 entries). I see, so the buffer size can limit that naturally. > Do you think it's that common to have fiemap and fsync attempts in parallel? > > I don't think it's a problem, not only it shouldn't be common, but if > it happens it's for a very short time. Even for a very long fiemap > buffer. Yeah it's not supposed to be common, I was thinking about the corner case. The worst case is when several parallel fiemaps are running that don't relinquish the inode lock and another thread writes+fsyncs the file and stalls. Imagine use case like a monitoring tool for VM images that watches used space due to thin provisioning and eg. qemu that uses the image file. The fix is to do fiemap in small chunks as said above.
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 9bd32daa9b9a..3bbf8703db2a 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3826,6 +3826,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo, lockend = round_up(start + len, inode->root->fs_info->sectorsize); prev_extent_end = lockstart; + btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED); lock_extent(&inode->io_tree, lockstart, lockend, &cached_state); ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end); @@ -4019,6 +4020,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo, out_unlock: unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state); + btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED); out: free_extent_state(delalloc_cached_state); btrfs_free_backref_share_ctx(backref_ctx);