Message ID | 20250113022426.703537-1-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | loop: don't call vfs_flush() with queue frozen | expand |
On Mon, Jan 13, 2025 at 10:24:26AM +0800, Ming Lei wrote: > If vfs_flush() is called with queue frozen, the queue freeze lock may be > connected with FS internal lock What "FS internal lock" ? > , and potential deadlock could be > triggered. > > Fix it by moving vfs_flush() out of queue freezing. That doesn't work. The pagecache will be dirties by any command processed using buffered I/O, so we need to freeze first to ensure that there are no outstanding commands.
On Sun, Jan 12, 2025 at 09:49:39PM -0800, Christoph Hellwig wrote: > On Mon, Jan 13, 2025 at 10:24:26AM +0800, Ming Lei wrote: > > If vfs_flush() is called with queue frozen, the queue freeze lock may be > > connected with FS internal lock > > What "FS internal lock" ? Please see the report: https://lore.kernel.org/linux-block/359BC288-B0B1-4815-9F01-3A349B12E816@m.fudan.edu.cn/T/#u > > > , and potential deadlock could be > > triggered. > > > > Fix it by moving vfs_flush() out of queue freezing. > > That doesn't work. The pagecache will be dirties by any command > processed using buffered I/O, so we need to freeze first to ensure > that there are no outstanding commands. vfs_flush() is called in case of previous buffered IO for flushing dirty pages. The call from loop_change_fd() has been broken, because ->lo_backing_file is updated to new one when calling loop_update_dio(), and I will cover this case in V2. For others in which backing file isn't changed, the patch is just fine in case of new buffered IO mode, isn't it? Thanks, Ming
On Mon, Jan 13, 2025 at 04:22:51PM +0800, Ming Lei wrote: > On Sun, Jan 12, 2025 at 09:49:39PM -0800, Christoph Hellwig wrote: > > On Mon, Jan 13, 2025 at 10:24:26AM +0800, Ming Lei wrote: > > > If vfs_flush() is called with queue frozen, the queue freeze lock may be > > > connected with FS internal lock > > > > What "FS internal lock" ? > > Please see the report: > > https://lore.kernel.org/linux-block/359BC288-B0B1-4815-9F01-3A349B12E816@m.fudan.edu.cn/T/#u Please state the locks. Nothing fs internal here, that report is about i_rwsem. And a false positive because it is about ordering of i_rwsem on the upper file system sitting on the loop device vs the one on the lower file systems sitting below the block device. These obviously can't deadlock, we just need to tell lockdep about that fact.
On Mon, Jan 13, 2025 at 01:18:00AM -0800, Christoph Hellwig wrote: > On Mon, Jan 13, 2025 at 04:22:51PM +0800, Ming Lei wrote: > > On Sun, Jan 12, 2025 at 09:49:39PM -0800, Christoph Hellwig wrote: > > > On Mon, Jan 13, 2025 at 10:24:26AM +0800, Ming Lei wrote: > > > > If vfs_flush() is called with queue frozen, the queue freeze lock may be > > > > connected with FS internal lock > > > > > > What "FS internal lock" ? > > > > Please see the report: > > > > https://lore.kernel.org/linux-block/359BC288-B0B1-4815-9F01-3A349B12E816@m.fudan.edu.cn/T/#u > > Please state the locks. Nothing fs internal here, that report is > about i_rwsem. And a false positive because it is about ordering > of i_rwsem on the upper file system sitting on the loop device vs the > one on the lower file systems sitting below the block device. These > obviously can't deadlock, we just need to tell lockdep about that fact. How can you guarantee that some code won't submit IO by grabbing the i_rwsem? As I explained, it is fine to move out vfs_fsync() out of freeze queue. Actually any lock which depends on freeze queue needs to take a careful look, because freeze queue connects too many global/sub-system locks. Thanks, Ming
On Mon, Jan 13, 2025 at 05:24:46PM +0800, Ming Lei wrote: > > Please state the locks. Nothing fs internal here, that report is > > about i_rwsem. And a false positive because it is about ordering > > of i_rwsem on the upper file system sitting on the loop device vs the > > one on the lower file systems sitting below the block device. These > > obviously can't deadlock, we just need to tell lockdep about that fact. > > How can you guarantee that some code won't submit IO by grabbing the > i_rwsem? ? A lot of the I/O will grab i_rwsem on the underlying device. Basically all writes, and for many file systems also on reads. But that is an entirely different i_rwsem as the one held the bio submitter as that is in different file system. There is no way the top file system can lock i_rwsem on the lower file system except through the loop driver, and that always sits below the freeze protection. > As I explained, it is fine to move out vfs_fsync() out of freeze queue. > > Actually any lock which depends on freeze queue needs to take a careful > look, because freeze queue connects too many global/sub-system locks. For block layer locks: absolutely. For file systems lock: not at all, because we're talking about different file systems instances. The only exception would be file systems taking global locks in the I/O path, but I sincerely hope no one does that.
On Mon, Jan 13, 2025 at 05:49:43AM -0800, Christoph Hellwig wrote: > On Mon, Jan 13, 2025 at 05:24:46PM +0800, Ming Lei wrote: > > > Please state the locks. Nothing fs internal here, that report is > > > about i_rwsem. And a false positive because it is about ordering > > > of i_rwsem on the upper file system sitting on the loop device vs the > > > one on the lower file systems sitting below the block device. These > > > obviously can't deadlock, we just need to tell lockdep about that fact. > > > > How can you guarantee that some code won't submit IO by grabbing the > > i_rwsem? > > ? A lot of the I/O will grab i_rwsem on the underlying device. > Basically all writes, and for many file systems also on reads. But > that is an entirely different i_rwsem as the one held the bio submitter > as that is in different file system. There is no way the top file > system can lock i_rwsem on the lower file system except through the > loop driver, and that always sits below the freeze protection. > > > As I explained, it is fine to move out vfs_fsync() out of freeze queue. > > > > Actually any lock which depends on freeze queue needs to take a careful > > look, because freeze queue connects too many global/sub-system locks. > > For block layer locks: absolutely. For file systems lock: not at all, > because we're talking about different file systems instances. The only > exception would be file systems taking global locks in the I/O path, > but I sincerely hope no one does that. Didn't you see the report on fs_reclaim and sysfs root lock? https://lore.kernel.org/linux-block/197b07435a736825ab40dab8d91db031c7fce37e.camel@linux.intel.com/ Thanks, Ming
> > > If vfs_flush() is called with queue frozen, the queue freeze lock may be > connected with FS internal lock, and potential deadlock could be > triggered. > > Fix it by moving vfs_flush() out of queue freezing. > > Reported-by: Kun Hu <huk23@m.fudan.edu.cn> > Reported-by: Jiaji Qin <jjtan24@m.fudan.edu.cn> > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > drivers/block/loop.c | 21 +++++++++++++++------ > 1 file changed, 15 insertions(+), 6 deletions(-) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index 1ec7417c7f00..9adf496b3f93 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -203,7 +203,7 @@ static bool lo_can_use_dio(struct loop_device *lo) > * loop_get_status will always report the effective LO_FLAGS_DIRECT_IO flag and > * not the originally passed in one. > */ > -static inline void loop_update_dio(struct loop_device *lo) > +static inline bool loop_update_dio(struct loop_device *lo) > { > bool dio_in_use = lo->lo_flags & LO_FLAGS_DIRECT_IO; > > @@ -217,8 +217,7 @@ static inline void loop_update_dio(struct loop_device *lo) > lo->lo_flags &= ~LO_FLAGS_DIRECT_IO; > > /* flush dirty pages before starting to issue direct I/O */ > - if ((lo->lo_flags & LO_FLAGS_DIRECT_IO) && !dio_in_use) > - vfs_fsync(lo->lo_backing_file, 0); > + return (lo->lo_flags & LO_FLAGS_DIRECT_IO) && !dio_in_use; > } > > /** > @@ -589,6 +588,7 @@ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev, > int error; > bool partscan; > bool is_loop; > + bool flush; > > if (!file) > return -EBADF; > @@ -629,11 +629,14 @@ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev, > lo->old_gfp_mask = mapping_gfp_mask(file->f_mapping); > mapping_set_gfp_mask(file->f_mapping, > lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS)); > - loop_update_dio(lo); > + flush = loop_update_dio(lo); > blk_mq_unfreeze_queue(lo->lo_queue); > partscan = lo->lo_flags & LO_FLAGS_PARTSCAN; > loop_global_unlock(lo, is_loop); > > + if (flush) > + vfs_fsync(lo->lo_backing_file, 0); > + > /* > * Flush loop_validate_file() before fput(), for l->lo_backing_file > * might be pointing at old_file which might be the last reference. > @@ -1255,6 +1258,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info) > int err; > bool partscan = false; > bool size_changed = false; > + bool flush = false; > > err = mutex_lock_killable(&lo->lo_mutex); > if (err) > @@ -1292,7 +1296,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info) > } > > /* update the direct I/O flag if lo_offset changed */ > - loop_update_dio(lo); > + flush = loop_update_dio(lo); > > out_unfreeze: > blk_mq_unfreeze_queue(lo->lo_queue); > @@ -1302,6 +1306,8 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info) > mutex_unlock(&lo->lo_mutex); > if (partscan) > loop_reread_partitions(lo); > + if (flush) > + vfs_fsync(lo->lo_backing_file, 0); > > return err; > } > @@ -1473,6 +1479,7 @@ static int loop_set_block_size(struct loop_device *lo, unsigned long arg) > { > struct queue_limits lim; > int err = 0; > + bool flush; > > if (lo->lo_state != Lo_bound) > return -ENXIO; > @@ -1488,8 +1495,10 @@ static int loop_set_block_size(struct loop_device *lo, unsigned long arg) > > blk_mq_freeze_queue(lo->lo_queue); > err = queue_limits_commit_update(lo->lo_queue, &lim); > - loop_update_dio(lo); > + flush = loop_update_dio(lo); > blk_mq_unfreeze_queue(lo->lo_queue); > + if (flush) > + vfs_fsync(lo->lo_backing_file, 0); > > return err; > } > -- > 2.44.0 > > Hello Ming, Should we test the fixed code through multiple rounds? Is there a conclusion to the ongoing discussion about the patch? —— Thanks, Kun Hu
On Mon, Jan 13, 2025 at 11:07:42PM +0800, Ming Lei wrote: > On Mon, Jan 13, 2025 at 05:49:43AM -0800, Christoph Hellwig wrote: > > On Mon, Jan 13, 2025 at 05:24:46PM +0800, Ming Lei wrote: > > > > Please state the locks. Nothing fs internal here, that report is > > > > about i_rwsem. And a false positive because it is about ordering > > > > of i_rwsem on the upper file system sitting on the loop device vs the > > > > one on the lower file systems sitting below the block device. These > > > > obviously can't deadlock, we just need to tell lockdep about that fact. > > > > > > How can you guarantee that some code won't submit IO by grabbing the > > > i_rwsem? > > > > ? A lot of the I/O will grab i_rwsem on the underlying device. > > Basically all writes, and for many file systems also on reads. But > > that is an entirely different i_rwsem as the one held the bio submitter > > as that is in different file system. There is no way the top file > > system can lock i_rwsem on the lower file system except through the > > loop driver, and that always sits below the freeze protection. Actually some FSs may call kmalloc(GFP_KERNEL) with i_rwsem grabbed, which could call into real deadlock if IO on the loop disk is caused by the kmalloc(GFP_KERNEL). So it is not one false positive. > > > > > As I explained, it is fine to move out vfs_fsync() out of freeze queue. > > > > > > Actually any lock which depends on freeze queue needs to take a careful > > > look, because freeze queue connects too many global/sub-system locks. > > > > For block layer locks: absolutely. For file systems lock: not at all, > > because we're talking about different file systems instances. The only > > exception would be file systems taking global locks in the I/O path, > > but I sincerely hope no one does that. > > Didn't you see the report on fs_reclaim and sysfs root lock? > > https://lore.kernel.org/linux-block/197b07435a736825ab40dab8d91db031c7fce37e.camel@linux.intel.com/ There are more, such as mm->mmap_lock[1], hfs SB 'cat_tree' lock[2]... [1] https://lore.kernel.org/linux-block/67863050.050a0220.216c54.006f.GAE@google.com/ [2] https://lore.kernel.org/linux-block/67582202.050a0220.a30f1.01cb.GAE@google.com/ Thanks, Ming
On Wed, Jan 15, 2025 at 10:37:08AM +0800, Ming Lei wrote: > Actually some FSs may call kmalloc(GFP_KERNEL) with i_rwsem grabbed, > which could call into real deadlock if IO on the loop disk is caused by > the kmalloc(GFP_KERNEL). Well, loop can always deadlock when the lower fs is doing allocations, even without the freeze. I'm actually kinda surprised loop doesn't force a context noio as we'd really need that. > > > because we're talking about different file systems instances. The only > > > exception would be file systems taking global locks in the I/O path, > > > but I sincerely hope no one does that. > > > > Didn't you see the report on fs_reclaim and sysfs root lock? > > > > https://lore.kernel.org/linux-block/197b07435a736825ab40dab8d91db031c7fce37e.camel@linux.intel.com/ > > There are more, such as mm->mmap_lock[1], hfs SB 'cat_tree' lock[2]... > > > [1] https://lore.kernel.org/linux-block/67863050.050a0220.216c54.006f.GAE@google.com/ > [2] https://lore.kernel.org/linux-block/67582202.050a0220.a30f1.01cb.GAE@google.com/ And all of these are caused by not breaking the dependency loop..
On Tue, Jan 14, 2025 at 11:05:00PM -0800, Christoph Hellwig wrote: > On Wed, Jan 15, 2025 at 10:37:08AM +0800, Ming Lei wrote: > > Actually some FSs may call kmalloc(GFP_KERNEL) with i_rwsem grabbed, > > which could call into real deadlock if IO on the loop disk is caused by > > the kmalloc(GFP_KERNEL). > > Well, loop can always deadlock when the lower fs is doing allocations, > even without the freeze. I'm actually kinda surprised loop doesn't > force a context noio as we'd really need that. Loop does call mapping_set_gfp_mask(~(__GFP_IO|__GFP_FS)). > > > > > because we're talking about different file systems instances. The only > > > > exception would be file systems taking global locks in the I/O path, > > > > but I sincerely hope no one does that. > > > > > > Didn't you see the report on fs_reclaim and sysfs root lock? > > > > > > https://lore.kernel.org/linux-block/197b07435a736825ab40dab8d91db031c7fce37e.camel@linux.intel.com/ > > > > There are more, such as mm->mmap_lock[1], hfs SB 'cat_tree' lock[2]... > > > > > > [1] https://lore.kernel.org/linux-block/67863050.050a0220.216c54.006f.GAE@google.com/ > > [2] https://lore.kernel.org/linux-block/67582202.050a0220.a30f1.01cb.GAE@google.com/ > > And all of these are caused by not breaking the dependency loop.. Can you share how to break the dependency? Such as fs_reclaim & mm->mmap_lock. Thanks, Ming
On Wed, Jan 15, 2025 at 04:21:40PM +0800, Ming Lei wrote: > On Tue, Jan 14, 2025 at 11:05:00PM -0800, Christoph Hellwig wrote: > > On Wed, Jan 15, 2025 at 10:37:08AM +0800, Ming Lei wrote: > > > Actually some FSs may call kmalloc(GFP_KERNEL) with i_rwsem grabbed, > > > which could call into real deadlock if IO on the loop disk is caused by > > > the kmalloc(GFP_KERNEL). > > > > Well, loop can always deadlock when the lower fs is doing allocations, > > even without the freeze. I'm actually kinda surprised loop doesn't > > force a context noio as we'd really need that. > > Loop does call mapping_set_gfp_mask(~(__GFP_IO|__GFP_FS)). But that only helps with page cache allocations, not other allocations don't in the I/O path, of which there are plenty in modern file systems. > > And all of these are caused by not breaking the dependency loop.. > > Can you share how to break the dependency? Such as fs_reclaim & > mm->mmap_lock. fs_reclaim should not be broken. It's a fake lockdep key annotation the reclaim path. Everything under it needs to be NOFS, or in case of loop NOIO (see above). I think mmap_lock is just an indirect chain here, but I need to take a closer look. A little busty this week and only sporadically online, sorry.
diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 1ec7417c7f00..9adf496b3f93 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -203,7 +203,7 @@ static bool lo_can_use_dio(struct loop_device *lo) * loop_get_status will always report the effective LO_FLAGS_DIRECT_IO flag and * not the originally passed in one. */ -static inline void loop_update_dio(struct loop_device *lo) +static inline bool loop_update_dio(struct loop_device *lo) { bool dio_in_use = lo->lo_flags & LO_FLAGS_DIRECT_IO; @@ -217,8 +217,7 @@ static inline void loop_update_dio(struct loop_device *lo) lo->lo_flags &= ~LO_FLAGS_DIRECT_IO; /* flush dirty pages before starting to issue direct I/O */ - if ((lo->lo_flags & LO_FLAGS_DIRECT_IO) && !dio_in_use) - vfs_fsync(lo->lo_backing_file, 0); + return (lo->lo_flags & LO_FLAGS_DIRECT_IO) && !dio_in_use; } /** @@ -589,6 +588,7 @@ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev, int error; bool partscan; bool is_loop; + bool flush; if (!file) return -EBADF; @@ -629,11 +629,14 @@ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev, lo->old_gfp_mask = mapping_gfp_mask(file->f_mapping); mapping_set_gfp_mask(file->f_mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS)); - loop_update_dio(lo); + flush = loop_update_dio(lo); blk_mq_unfreeze_queue(lo->lo_queue); partscan = lo->lo_flags & LO_FLAGS_PARTSCAN; loop_global_unlock(lo, is_loop); + if (flush) + vfs_fsync(lo->lo_backing_file, 0); + /* * Flush loop_validate_file() before fput(), for l->lo_backing_file * might be pointing at old_file which might be the last reference. @@ -1255,6 +1258,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info) int err; bool partscan = false; bool size_changed = false; + bool flush = false; err = mutex_lock_killable(&lo->lo_mutex); if (err) @@ -1292,7 +1296,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info) } /* update the direct I/O flag if lo_offset changed */ - loop_update_dio(lo); + flush = loop_update_dio(lo); out_unfreeze: blk_mq_unfreeze_queue(lo->lo_queue); @@ -1302,6 +1306,8 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info) mutex_unlock(&lo->lo_mutex); if (partscan) loop_reread_partitions(lo); + if (flush) + vfs_fsync(lo->lo_backing_file, 0); return err; } @@ -1473,6 +1479,7 @@ static int loop_set_block_size(struct loop_device *lo, unsigned long arg) { struct queue_limits lim; int err = 0; + bool flush; if (lo->lo_state != Lo_bound) return -ENXIO; @@ -1488,8 +1495,10 @@ static int loop_set_block_size(struct loop_device *lo, unsigned long arg) blk_mq_freeze_queue(lo->lo_queue); err = queue_limits_commit_update(lo->lo_queue, &lim); - loop_update_dio(lo); + flush = loop_update_dio(lo); blk_mq_unfreeze_queue(lo->lo_queue); + if (flush) + vfs_fsync(lo->lo_backing_file, 0); return err; }
If vfs_flush() is called with queue frozen, the queue freeze lock may be connected with FS internal lock, and potential deadlock could be triggered. Fix it by moving vfs_flush() out of queue freezing. Reported-by: Kun Hu <huk23@m.fudan.edu.cn> Reported-by: Jiaji Qin <jjtan24@m.fudan.edu.cn> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- drivers/block/loop.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-)