loop: don't call vfs_flush() with queue frozen

Message ID	20250113022426.703537-1-ming.lei@redhat.com (mailing list archive)
State	New
Headers	show Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1FE488F7D for <linux-block@vger.kernel.org>; Mon, 13 Jan 2025 02:24:38 +0000 (UTC) From: Ming Lei <ming.lei@redhat.com> To: Jens Axboe <axboe@kernel.dk>, linux-block@vger.kernel.org Cc: Ming Lei <ming.lei@redhat.com>, Kun Hu <huk23@m.fudan.edu.cn>, Jiaji Qin <jjtan24@m.fudan.edu.cn> Subject: [PATCH] loop: don't call vfs_flush() with queue frozen Date: Mon, 13 Jan 2025 10:24:26 +0800 Message-ID: <20250113022426.703537-1-ming.lei@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	loop: don't call vfs_flush() with queue frozen \| expand loop: don't call vfs_flush() with queue frozen

Ming Lei Jan. 13, 2025, 2:24 a.m. UTC

If vfs_flush() is called with queue frozen, the queue freeze lock may be
connected with FS internal lock, and potential deadlock could be
triggered.

Fix it by moving vfs_flush() out of queue freezing.

Reported-by: Kun Hu <huk23@m.fudan.edu.cn>
Reported-by: Jiaji Qin <jjtan24@m.fudan.edu.cn>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/loop.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

Christoph Hellwig Jan. 13, 2025, 5:49 a.m. UTC | #1

On Mon, Jan 13, 2025 at 10:24:26AM +0800, Ming Lei wrote:
> If vfs_flush() is called with queue frozen, the queue freeze lock may be
> connected with FS internal lock

What "FS internal lock" ?

> , and potential deadlock could be
> triggered.
> 
> Fix it by moving vfs_flush() out of queue freezing.

That doesn't work.  The pagecache will be dirties by any command
processed using buffered I/O, so we need to freeze first to ensure
that there are no outstanding commands.

Ming Lei Jan. 13, 2025, 8:22 a.m. UTC | #2

On Sun, Jan 12, 2025 at 09:49:39PM -0800, Christoph Hellwig wrote:
> On Mon, Jan 13, 2025 at 10:24:26AM +0800, Ming Lei wrote:
> > If vfs_flush() is called with queue frozen, the queue freeze lock may be
> > connected with FS internal lock
> 
> What "FS internal lock" ?

Please see the report:

https://lore.kernel.org/linux-block/359BC288-B0B1-4815-9F01-3A349B12E816@m.fudan.edu.cn/T/#u

> 
> > , and potential deadlock could be
> > triggered.
> > 
> > Fix it by moving vfs_flush() out of queue freezing.
> 
> That doesn't work.  The pagecache will be dirties by any command
> processed using buffered I/O, so we need to freeze first to ensure
> that there are no outstanding commands.

vfs_flush() is called in case of previous buffered IO for flushing dirty
pages.

The call from loop_change_fd() has been broken, because ->lo_backing_file
is updated to new one when calling loop_update_dio(), and I will cover
this case in V2.

For others in which backing file isn't changed, the patch is just fine
in case of new buffered IO mode, isn't it?

Thanks,
Ming

Christoph Hellwig Jan. 13, 2025, 9:18 a.m. UTC | #3

On Mon, Jan 13, 2025 at 04:22:51PM +0800, Ming Lei wrote:
> On Sun, Jan 12, 2025 at 09:49:39PM -0800, Christoph Hellwig wrote:
> > On Mon, Jan 13, 2025 at 10:24:26AM +0800, Ming Lei wrote:
> > > If vfs_flush() is called with queue frozen, the queue freeze lock may be
> > > connected with FS internal lock
> > 
> > What "FS internal lock" ?
> 
> Please see the report:
> 
> https://lore.kernel.org/linux-block/359BC288-B0B1-4815-9F01-3A349B12E816@m.fudan.edu.cn/T/#u

Please state the locks.  Nothing fs internal here, that report is
about i_rwsem.  And a false positive because it is about ordering
of i_rwsem on the upper file system sitting on the loop device vs the
one on the lower file systems sitting below the block device.  These
obviously can't deadlock, we just need to tell lockdep about that fact.

Ming Lei Jan. 13, 2025, 9:24 a.m. UTC | #4

On Mon, Jan 13, 2025 at 01:18:00AM -0800, Christoph Hellwig wrote:
> On Mon, Jan 13, 2025 at 04:22:51PM +0800, Ming Lei wrote:
> > On Sun, Jan 12, 2025 at 09:49:39PM -0800, Christoph Hellwig wrote:
> > > On Mon, Jan 13, 2025 at 10:24:26AM +0800, Ming Lei wrote:
> > > > If vfs_flush() is called with queue frozen, the queue freeze lock may be
> > > > connected with FS internal lock
> > > 
> > > What "FS internal lock" ?
> > 
> > Please see the report:
> > 
> > https://lore.kernel.org/linux-block/359BC288-B0B1-4815-9F01-3A349B12E816@m.fudan.edu.cn/T/#u
> 
> Please state the locks.  Nothing fs internal here, that report is
> about i_rwsem.  And a false positive because it is about ordering
> of i_rwsem on the upper file system sitting on the loop device vs the
> one on the lower file systems sitting below the block device.  These
> obviously can't deadlock, we just need to tell lockdep about that fact.

How can you guarantee that some code won't submit IO by grabbing the
i_rwsem?

As I explained, it is fine to move out vfs_fsync() out of freeze queue.

Actually any lock which depends on freeze queue needs to take a careful
look, because freeze queue connects too many global/sub-system locks.

Thanks,
Ming

Christoph Hellwig Jan. 13, 2025, 1:49 p.m. UTC | #5

On Mon, Jan 13, 2025 at 05:24:46PM +0800, Ming Lei wrote:
> > Please state the locks.  Nothing fs internal here, that report is
> > about i_rwsem.  And a false positive because it is about ordering
> > of i_rwsem on the upper file system sitting on the loop device vs the
> > one on the lower file systems sitting below the block device.  These
> > obviously can't deadlock, we just need to tell lockdep about that fact.
> 
> How can you guarantee that some code won't submit IO by grabbing the
> i_rwsem?

?  A lot of the I/O will grab i_rwsem on the underlying device.
Basically all writes, and for many file systems also on reads.  But
that is an entirely different i_rwsem as the one held the bio submitter
as that is in different file system.  There is no way the top file
system can lock i_rwsem on the lower file system except through the
loop driver, and that always sits below the freeze protection.

> As I explained, it is fine to move out vfs_fsync() out of freeze queue.
> 
> Actually any lock which depends on freeze queue needs to take a careful
> look, because freeze queue connects too many global/sub-system locks.

For block layer locks: absolutely.  For file systems lock: not at all,
because we're talking about different file systems instances.  The only
exception would be file systems taking global locks in the I/O path,
but I sincerely hope no one does that.

Ming Lei Jan. 13, 2025, 3:07 p.m. UTC | #6

On Mon, Jan 13, 2025 at 05:49:43AM -0800, Christoph Hellwig wrote:
> On Mon, Jan 13, 2025 at 05:24:46PM +0800, Ming Lei wrote:
> > > Please state the locks.  Nothing fs internal here, that report is
> > > about i_rwsem.  And a false positive because it is about ordering
> > > of i_rwsem on the upper file system sitting on the loop device vs the
> > > one on the lower file systems sitting below the block device.  These
> > > obviously can't deadlock, we just need to tell lockdep about that fact.
> > 
> > How can you guarantee that some code won't submit IO by grabbing the
> > i_rwsem?
> 
> ?  A lot of the I/O will grab i_rwsem on the underlying device.
> Basically all writes, and for many file systems also on reads.  But
> that is an entirely different i_rwsem as the one held the bio submitter
> as that is in different file system.  There is no way the top file
> system can lock i_rwsem on the lower file system except through the
> loop driver, and that always sits below the freeze protection.
> 
> > As I explained, it is fine to move out vfs_fsync() out of freeze queue.
> > 
> > Actually any lock which depends on freeze queue needs to take a careful
> > look, because freeze queue connects too many global/sub-system locks.
> 
> For block layer locks: absolutely.  For file systems lock: not at all,
> because we're talking about different file systems instances.  The only
> exception would be file systems taking global locks in the I/O path,
> but I sincerely hope no one does that.
 
Didn't you see the report on fs_reclaim and sysfs root lock?

https://lore.kernel.org/linux-block/197b07435a736825ab40dab8d91db031c7fce37e.camel@linux.intel.com/


Thanks, 
Ming

Kun Hu Jan. 14, 2025, 3:21 a.m. UTC | #7

> 
> 
> If vfs_flush() is called with queue frozen, the queue freeze lock may be
> connected with FS internal lock, and potential deadlock could be
> triggered.
> 
> Fix it by moving vfs_flush() out of queue freezing.
> 
> Reported-by: Kun Hu <huk23@m.fudan.edu.cn>
> Reported-by: Jiaji Qin <jjtan24@m.fudan.edu.cn>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> drivers/block/loop.c | 21 +++++++++++++++------
> 1 file changed, 15 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 1ec7417c7f00..9adf496b3f93 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -203,7 +203,7 @@ static bool lo_can_use_dio(struct loop_device *lo)
>  * loop_get_status will always report the effective LO_FLAGS_DIRECT_IO flag and
>  * not the originally passed in one.
>  */
> -static inline void loop_update_dio(struct loop_device *lo)
> +static inline bool loop_update_dio(struct loop_device *lo)
> {
> bool dio_in_use = lo->lo_flags & LO_FLAGS_DIRECT_IO;
> 
> @@ -217,8 +217,7 @@ static inline void loop_update_dio(struct loop_device *lo)
> lo->lo_flags &= ~LO_FLAGS_DIRECT_IO;
> 
> /* flush dirty pages before starting to issue direct I/O */
> - if ((lo->lo_flags & LO_FLAGS_DIRECT_IO) && !dio_in_use)
> - vfs_fsync(lo->lo_backing_file, 0);
> + return (lo->lo_flags & LO_FLAGS_DIRECT_IO) && !dio_in_use;
> }
> 
> /**
> @@ -589,6 +588,7 @@ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev,
> int error;
> bool partscan;
> bool is_loop;
> + bool flush;
> 
> if (!file)
> return -EBADF;
> @@ -629,11 +629,14 @@ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev,
> lo->old_gfp_mask = mapping_gfp_mask(file->f_mapping);
> mapping_set_gfp_mask(file->f_mapping,
>     lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
> - loop_update_dio(lo);
> + flush = loop_update_dio(lo);
> blk_mq_unfreeze_queue(lo->lo_queue);
> partscan = lo->lo_flags & LO_FLAGS_PARTSCAN;
> loop_global_unlock(lo, is_loop);
> 
> + if (flush)
> + vfs_fsync(lo->lo_backing_file, 0);
> +
> /*
> * Flush loop_validate_file() before fput(), for l->lo_backing_file
> * might be pointing at old_file which might be the last reference.
> @@ -1255,6 +1258,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
> int err;
> bool partscan = false;
> bool size_changed = false;
> + bool flush = false;
> 
> err = mutex_lock_killable(&lo->lo_mutex);
> if (err)
> @@ -1292,7 +1296,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
> }
> 
> /* update the direct I/O flag if lo_offset changed */
> - loop_update_dio(lo);
> + flush = loop_update_dio(lo);
> 
> out_unfreeze:
> blk_mq_unfreeze_queue(lo->lo_queue);
> @@ -1302,6 +1306,8 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
> mutex_unlock(&lo->lo_mutex);
> if (partscan)
> loop_reread_partitions(lo);
> + if (flush)
> + vfs_fsync(lo->lo_backing_file, 0);
> 
> return err;
> }
> @@ -1473,6 +1479,7 @@ static int loop_set_block_size(struct loop_device *lo, unsigned long arg)
> {
> struct queue_limits lim;
> int err = 0;
> + bool flush;
> 
> if (lo->lo_state != Lo_bound)
> return -ENXIO;
> @@ -1488,8 +1495,10 @@ static int loop_set_block_size(struct loop_device *lo, unsigned long arg)
> 
> blk_mq_freeze_queue(lo->lo_queue);
> err = queue_limits_commit_update(lo->lo_queue, &lim);
> - loop_update_dio(lo);
> + flush = loop_update_dio(lo);
> blk_mq_unfreeze_queue(lo->lo_queue);
> + if (flush)
> + vfs_fsync(lo->lo_backing_file, 0);
> 
> return err;
> }
> -- 
> 2.44.0
> 
> 

Hello Ming,

Should we test the fixed code through multiple rounds? Is there a conclusion to the ongoing discussion about the patch?

——
Thanks,
Kun Hu

Ming Lei Jan. 15, 2025, 2:37 a.m. UTC | #8

On Mon, Jan 13, 2025 at 11:07:42PM +0800, Ming Lei wrote:
> On Mon, Jan 13, 2025 at 05:49:43AM -0800, Christoph Hellwig wrote:
> > On Mon, Jan 13, 2025 at 05:24:46PM +0800, Ming Lei wrote:
> > > > Please state the locks.  Nothing fs internal here, that report is
> > > > about i_rwsem.  And a false positive because it is about ordering
> > > > of i_rwsem on the upper file system sitting on the loop device vs the
> > > > one on the lower file systems sitting below the block device.  These
> > > > obviously can't deadlock, we just need to tell lockdep about that fact.
> > > 
> > > How can you guarantee that some code won't submit IO by grabbing the
> > > i_rwsem?
> > 
> > ?  A lot of the I/O will grab i_rwsem on the underlying device.
> > Basically all writes, and for many file systems also on reads.  But
> > that is an entirely different i_rwsem as the one held the bio submitter
> > as that is in different file system.  There is no way the top file
> > system can lock i_rwsem on the lower file system except through the
> > loop driver, and that always sits below the freeze protection.

Actually some FSs may call kmalloc(GFP_KERNEL) with i_rwsem grabbed,
which could call into real deadlock if IO on the loop disk is caused by
the kmalloc(GFP_KERNEL).

So it is not one false positive.

> > 
> > > As I explained, it is fine to move out vfs_fsync() out of freeze queue.
> > > 
> > > Actually any lock which depends on freeze queue needs to take a careful
> > > look, because freeze queue connects too many global/sub-system locks.
> > 
> > For block layer locks: absolutely.  For file systems lock: not at all,
> > because we're talking about different file systems instances.  The only
> > exception would be file systems taking global locks in the I/O path,
> > but I sincerely hope no one does that.
>  
> Didn't you see the report on fs_reclaim and sysfs root lock?
> 
> https://lore.kernel.org/linux-block/197b07435a736825ab40dab8d91db031c7fce37e.camel@linux.intel.com/

There are more, such as mm->mmap_lock[1], hfs SB 'cat_tree' lock[2]...


[1] https://lore.kernel.org/linux-block/67863050.050a0220.216c54.006f.GAE@google.com/
[2] https://lore.kernel.org/linux-block/67582202.050a0220.a30f1.01cb.GAE@google.com/

Thanks,
Ming

Christoph Hellwig Jan. 15, 2025, 7:05 a.m. UTC | #9

On Wed, Jan 15, 2025 at 10:37:08AM +0800, Ming Lei wrote:
> Actually some FSs may call kmalloc(GFP_KERNEL) with i_rwsem grabbed,
> which could call into real deadlock if IO on the loop disk is caused by
> the kmalloc(GFP_KERNEL).

Well, loop can always deadlock when the lower fs is doing allocations,
even without the freeze.  I'm actually kinda surprised loop doesn't
force a context noio as we'd really need that.

> > > because we're talking about different file systems instances.  The only
> > > exception would be file systems taking global locks in the I/O path,
> > > but I sincerely hope no one does that.
> >  
> > Didn't you see the report on fs_reclaim and sysfs root lock?
> > 
> > https://lore.kernel.org/linux-block/197b07435a736825ab40dab8d91db031c7fce37e.camel@linux.intel.com/
> 
> There are more, such as mm->mmap_lock[1], hfs SB 'cat_tree' lock[2]...
> 
> 
> [1] https://lore.kernel.org/linux-block/67863050.050a0220.216c54.006f.GAE@google.com/
> [2] https://lore.kernel.org/linux-block/67582202.050a0220.a30f1.01cb.GAE@google.com/

And all of these are caused by not breaking the dependency loop..

Ming Lei Jan. 15, 2025, 8:21 a.m. UTC | #10

On Tue, Jan 14, 2025 at 11:05:00PM -0800, Christoph Hellwig wrote:
> On Wed, Jan 15, 2025 at 10:37:08AM +0800, Ming Lei wrote:
> > Actually some FSs may call kmalloc(GFP_KERNEL) with i_rwsem grabbed,
> > which could call into real deadlock if IO on the loop disk is caused by
> > the kmalloc(GFP_KERNEL).
> 
> Well, loop can always deadlock when the lower fs is doing allocations,
> even without the freeze.  I'm actually kinda surprised loop doesn't
> force a context noio as we'd really need that.

Loop does call mapping_set_gfp_mask(~(__GFP_IO|__GFP_FS)).

> 
> > > > because we're talking about different file systems instances.  The only
> > > > exception would be file systems taking global locks in the I/O path,
> > > > but I sincerely hope no one does that.
> > >  
> > > Didn't you see the report on fs_reclaim and sysfs root lock?
> > > 
> > > https://lore.kernel.org/linux-block/197b07435a736825ab40dab8d91db031c7fce37e.camel@linux.intel.com/
> > 
> > There are more, such as mm->mmap_lock[1], hfs SB 'cat_tree' lock[2]...
> > 
> > 
> > [1] https://lore.kernel.org/linux-block/67863050.050a0220.216c54.006f.GAE@google.com/
> > [2] https://lore.kernel.org/linux-block/67582202.050a0220.a30f1.01cb.GAE@google.com/
> 
> And all of these are caused by not breaking the dependency loop..

Can you share how to break the dependency? Such as fs_reclaim &
mm->mmap_lock.


Thanks,
Ming

Christoph Hellwig Jan. 15, 2025, 9:40 a.m. UTC | #11

On Wed, Jan 15, 2025 at 04:21:40PM +0800, Ming Lei wrote:
> On Tue, Jan 14, 2025 at 11:05:00PM -0800, Christoph Hellwig wrote:
> > On Wed, Jan 15, 2025 at 10:37:08AM +0800, Ming Lei wrote:
> > > Actually some FSs may call kmalloc(GFP_KERNEL) with i_rwsem grabbed,
> > > which could call into real deadlock if IO on the loop disk is caused by
> > > the kmalloc(GFP_KERNEL).
> > 
> > Well, loop can always deadlock when the lower fs is doing allocations,
> > even without the freeze.  I'm actually kinda surprised loop doesn't
> > force a context noio as we'd really need that.
> 
> Loop does call mapping_set_gfp_mask(~(__GFP_IO|__GFP_FS)).

But that only helps with page cache allocations, not other allocations
don't in the I/O path, of which there are plenty in modern file systems.

> > And all of these are caused by not breaking the dependency loop..
> 
> Can you share how to break the dependency? Such as fs_reclaim &
> mm->mmap_lock.

fs_reclaim should not be broken.  It's a fake lockdep key annotation
the reclaim path.  Everything under it needs to be NOFS, or in case
of loop NOIO (see above).  I think mmap_lock is just an indirect chain
here, but I need to take a closer look.  A little busty this week and
only sporadically online, sorry.

loop: don't call vfs_flush() with queue frozen

Commit Message

Comments

Patch