diff mbox series

mm/migrate: fix deadlock in migrate_pages_batch() on large folios

Message ID 20240728154913.4023977-1-hsiangkao@linux.alibaba.com (mailing list archive)
State New
Headers show
Series mm/migrate: fix deadlock in migrate_pages_batch() on large folios | expand

Commit Message

Gao Xiang July 28, 2024, 3:49 p.m. UTC
Currently, migrate_pages_batch() can lock multiple locked folios
with an arbitrary order.  Although folio_trylock() is used to avoid
deadlock as commit 2ef7dbb26990 ("migrate_pages: try migrate in batch
asynchronously firstly") mentioned, it seems try_split_folio() is
still missing.

It was found by compaction stress test when I explicitly enable EROFS
compressed files to use large folios, which case I cannot reproduce with
the same workload if large folio support is off (current mainline).
Typically, filesystem reads (with locked file-backed folios) could use
another bdev/meta inode to load some other I/Os (e.g. inode extent
metadata or caching compressed data), so the locking order will be:

  file-backed folios  (A)
     bdev/meta folios (B)

The following calltrace shows the deadlock:
   Thread 1 takes (B) lock and tries to take folio (A) lock
   Thread 2 takes (A) lock and tries to take folio (B) lock

[Thread 1]
INFO: task stress:1824 blocked for more than 30 seconds.
      Tainted: G           OE      6.10.0-rc7+ #6
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:stress          state:D stack:0     pid:1824  tgid:1824  ppid:1822   flags:0x0000000c
Call trace:
 __switch_to+0xec/0x138
 __schedule+0x43c/0xcb0
 schedule+0x54/0x198
 io_schedule+0x44/0x70
 folio_wait_bit_common+0x184/0x3f8
			<-- folio mapping ffff00036d69cb18 index 996  (**)
 __folio_lock+0x24/0x38
 migrate_pages_batch+0x77c/0xea0	// try_split_folio (mm/migrate.c:1486:2)
						// migrate_pages_batch (mm/migrate.c:1734:16)
		<--- LIST_HEAD(unmap_folios) has
			..
			folio mapping 0xffff0000d184f1d8 index 1711;   (*)
			folio mapping 0xffff0000d184f1d8 index 1712;
			..
 migrate_pages+0xb28/0xe90
 compact_zone+0xa08/0x10f0
 compact_node+0x9c/0x180
 sysctl_compaction_handler+0x8c/0x118
 proc_sys_call_handler+0x1a8/0x280
 proc_sys_write+0x1c/0x30
 vfs_write+0x240/0x380
 ksys_write+0x78/0x118
 __arm64_sys_write+0x24/0x38
 invoke_syscall+0x78/0x108
 el0_svc_common.constprop.0+0x48/0xf0
 do_el0_svc+0x24/0x38
 el0_svc+0x3c/0x148
 el0t_64_sync_handler+0x100/0x130
 el0t_64_sync+0x190/0x198

[Thread 2]
INFO: task stress:1825 blocked for more than 30 seconds.
      Tainted: G           OE      6.10.0-rc7+ #6
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:stress          state:D stack:0     pid:1825  tgid:1825  ppid:1822   flags:0x0000000c
Call trace:
 __switch_to+0xec/0x138
 __schedule+0x43c/0xcb0
 schedule+0x54/0x198
 io_schedule+0x44/0x70
 folio_wait_bit_common+0x184/0x3f8
			<-- folio = 0xfffffdffc6b503c0 (mapping == 0xffff0000d184f1d8 index == 1711) (*)
 __folio_lock+0x24/0x38
 z_erofs_runqueue+0x384/0x9c0 [erofs]
 z_erofs_readahead+0x21c/0x350 [erofs]       <-- folio mapping 0xffff00036d69cb18 range from [992, 1024] (**)
 read_pages+0x74/0x328
 page_cache_ra_order+0x26c/0x348
 ondemand_readahead+0x1c0/0x3a0
 page_cache_sync_ra+0x9c/0xc0
 filemap_get_pages+0xc4/0x708
 filemap_read+0x104/0x3a8
 generic_file_read_iter+0x4c/0x150
 vfs_read+0x27c/0x330
 ksys_pread64+0x84/0xd0
 __arm64_sys_pread64+0x28/0x40
 invoke_syscall+0x78/0x108
 el0_svc_common.constprop.0+0x48/0xf0
 do_el0_svc+0x24/0x38
 el0_svc+0x3c/0x148
 el0t_64_sync_handler+0x100/0x130
 el0t_64_sync+0x190/0x198

Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move")
Cc: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
 mm/migrate.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Andrew Morton July 28, 2024, 7:50 p.m. UTC | #1
On Sun, 28 Jul 2024 23:49:13 +0800 Gao Xiang <hsiangkao@linux.alibaba.com> wrote:

> Currently, migrate_pages_batch() can lock multiple locked folios
> with an arbitrary order.  Although folio_trylock() is used to avoid
> deadlock as commit 2ef7dbb26990 ("migrate_pages: try migrate in batch
> asynchronously firstly") mentioned, it seems try_split_folio() is
> still missing.

Am I correct in believing that folio_lock() doesn't have lockdep coverage?

> It was found by compaction stress test when I explicitly enable EROFS
> compressed files to use large folios, which case I cannot reproduce with
> the same workload if large folio support is off (current mainline).
> Typically, filesystem reads (with locked file-backed folios) could use
> another bdev/meta inode to load some other I/Os (e.g. inode extent
> metadata or caching compressed data), so the locking order will be:

Which kernels need fixing.  Do we expect that any code paths in 6.10 or
earlier are vulnerable to this?
Matthew Wilcox July 28, 2024, 9:17 p.m. UTC | #2
On Sun, Jul 28, 2024 at 12:50:05PM -0700, Andrew Morton wrote:
> On Sun, 28 Jul 2024 23:49:13 +0800 Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> > Currently, migrate_pages_batch() can lock multiple locked folios
> > with an arbitrary order.  Although folio_trylock() is used to avoid
> > deadlock as commit 2ef7dbb26990 ("migrate_pages: try migrate in batch
> > asynchronously firstly") mentioned, it seems try_split_folio() is
> > still missing.
> 
> Am I correct in believing that folio_lock() doesn't have lockdep coverage?

Yes.  It can't; it is taken in process context and released by whatever
context the read completion happens in (could be hard/soft irq, could be
a workqueue, could be J. Random kthread, depending on the device driver)
So it doesn't match the lockdep model at all.

> > It was found by compaction stress test when I explicitly enable EROFS
> > compressed files to use large folios, which case I cannot reproduce with
> > the same workload if large folio support is off (current mainline).
> > Typically, filesystem reads (with locked file-backed folios) could use
> > another bdev/meta inode to load some other I/Os (e.g. inode extent
> > metadata or caching compressed data), so the locking order will be:
> 
> Which kernels need fixing.  Do we expect that any code paths in 6.10 or
> earlier are vulnerable to this?

I would suggest it goes back to the introduction of large folios, but
that's just a gut feeling based on absolutely no reading of code or
inspection of git history.
Matthew Wilcox July 28, 2024, 9:46 p.m. UTC | #3
On Sun, Jul 28, 2024 at 11:49:13PM +0800, Gao Xiang wrote:
> It was found by compaction stress test when I explicitly enable EROFS
> compressed files to use large folios, which case I cannot reproduce with
> the same workload if large folio support is off (current mainline).
> Typically, filesystem reads (with locked file-backed folios) could use
> another bdev/meta inode to load some other I/Os (e.g. inode extent
> metadata or caching compressed data), so the locking order will be:

Umm.  That is a new constraint to me.  We have two other places which
take the folio lock in a particular order.  Writeback takes locks on
folios belonging to the same inode in ascending ->index order.  It
submits all the folios for write before moving on to lock other inodes,
so it does not conflict with this new constraint you're proposing.

The other place is remap_file_range().  Both inodes in that case must be
regular files,
        if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
                return -EINVAL;
so this new rule is fine.

Does anybody know of any _other_ ordering constraints on folio locks?  I'm
willing to write them down ...

> diff --git a/mm/migrate.c b/mm/migrate.c
> index 20cb9f5f7446..a912e4b83228 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1483,7 +1483,8 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
>  {
>  	int rc;
>  
> -	folio_lock(folio);
> +	if (!folio_trylock(folio))
> +		return -EAGAIN;
>  	rc = split_folio_to_list(folio, split_folios);
>  	folio_unlock(folio);
>  	if (!rc)

This feels like the best quick fix to me since migration is going to
walk the folios in a different order from writeback.  I'm surprised
this hasn't already bitten us, to be honest.

(ie I don't think this is even necessarily connected to the new
ordering constraint; I think migration and writeback can already
deadlock)
Gao Xiang July 28, 2024, 10:11 p.m. UTC | #4
Hi,

On 2024/7/29 05:46, Matthew Wilcox wrote:
> On Sun, Jul 28, 2024 at 11:49:13PM +0800, Gao Xiang wrote:
>> It was found by compaction stress test when I explicitly enable EROFS
>> compressed files to use large folios, which case I cannot reproduce with
>> the same workload if large folio support is off (current mainline).
>> Typically, filesystem reads (with locked file-backed folios) could use
>> another bdev/meta inode to load some other I/Os (e.g. inode extent
>> metadata or caching compressed data), so the locking order will be:
> 
> Umm.  That is a new constraint to me.  We have two other places which
> take the folio lock in a particular order.  Writeback takes locks on
> folios belonging to the same inode in ascending ->index order.  It
> submits all the folios for write before moving on to lock other inodes,
> so it does not conflict with this new constraint you're proposing.

BTW, I don't believe it's a new order out of EROFS, if you consider
ext4 or ext2 for example, it will also use sb_bread() (buffer heads
on bdev inode to trigger some meta I/Os),

e.g. take ext2 for simplicity:
   ext2_readahead
     mpage_readahead
      ext2_get_block
        ext2_get_blocks
          ext2_get_branch
             sb_bread     <-- get some metadata using for this data I/O

> 
> The other place is remap_file_range().  Both inodes in that case must be
> regular files,
>          if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
>                  return -EINVAL;
> so this new rule is fine.
> 
> Does anybody know of any _other_ ordering constraints on folio locks?  I'm
> willing to write them down ...

Personally I don't think out any particular order between two folio
locks acrossing different inodes, so I think folio batching locking
always needs to be taken care.

> 
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 20cb9f5f7446..a912e4b83228 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1483,7 +1483,8 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
>>   {
>>   	int rc;
>>   
>> -	folio_lock(folio);
>> +	if (!folio_trylock(folio))
>> +		return -EAGAIN;
>>   	rc = split_folio_to_list(folio, split_folios);
>>   	folio_unlock(folio);
>>   	if (!rc)
> 
> This feels like the best quick fix to me since migration is going to
> walk the folios in a different order from writeback.  I'm surprised
> this hasn't already bitten us, to be honest.

My stress workload explicitly triggers compaction and other EROFS
read loads, I'm not sure if others just test like this too, but:
https://lore.kernel.org/r/20240418001356.95857-1-mcgrof@kernel.org

seems like a similar load.

Thanks,
Gao Xiang

> 
> (ie I don't think this is even necessarily connected to the new
> ordering constraint; I think migration and writeback can already
> deadlock)
Gao Xiang July 28, 2024, 10:35 p.m. UTC | #5
Hi,

On 2024/7/29 05:17, Matthew Wilcox wrote:
> On Sun, Jul 28, 2024 at 12:50:05PM -0700, Andrew Morton wrote:
>> On Sun, 28 Jul 2024 23:49:13 +0800 Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>> Currently, migrate_pages_batch() can lock multiple locked folios
>>> with an arbitrary order.  Although folio_trylock() is used to avoid
>>> deadlock as commit 2ef7dbb26990 ("migrate_pages: try migrate in batch
>>> asynchronously firstly") mentioned, it seems try_split_folio() is
>>> still missing.
>>
>> Am I correct in believing that folio_lock() doesn't have lockdep coverage?
> 
> Yes.  It can't; it is taken in process context and released by whatever
> context the read completion happens in (could be hard/soft irq, could be
> a workqueue, could be J. Random kthread, depending on the device driver)
> So it doesn't match the lockdep model at all.
> 
>>> It was found by compaction stress test when I explicitly enable EROFS
>>> compressed files to use large folios, which case I cannot reproduce with
>>> the same workload if large folio support is off (current mainline).
>>> Typically, filesystem reads (with locked file-backed folios) could use
>>> another bdev/meta inode to load some other I/Os (e.g. inode extent
>>> metadata or caching compressed data), so the locking order will be:
>>
>> Which kernels need fixing.  Do we expect that any code paths in 6.10 or
>> earlier are vulnerable to this?
> 
> I would suggest it goes back to the introduction of large folios, but
> that's just a gut feeling based on absolutely no reading of code or
> inspection of git history.

According to 5dfab109d519 ("migrate_pages: batch _unmap and _move"),
I think it's v6.3+.

Yet I don't have more time to look info all history of batching
migration, hoping Huang, Ying could give more hints on this.

Thanks,
Gao Xiang
Huang, Ying July 29, 2024, 1:38 a.m. UTC | #6
Hi, Xiang,

Gao Xiang <hsiangkao@linux.alibaba.com> writes:

> Currently, migrate_pages_batch() can lock multiple locked folios
> with an arbitrary order.  Although folio_trylock() is used to avoid
> deadlock as commit 2ef7dbb26990 ("migrate_pages: try migrate in batch
> asynchronously firstly") mentioned, it seems try_split_folio() is
> still missing.
>
> It was found by compaction stress test when I explicitly enable EROFS
> compressed files to use large folios, which case I cannot reproduce with
> the same workload if large folio support is off (current mainline).
> Typically, filesystem reads (with locked file-backed folios) could use
> another bdev/meta inode to load some other I/Os (e.g. inode extent
> metadata or caching compressed data), so the locking order will be:
>
>   file-backed folios  (A)
>      bdev/meta folios (B)
>
> The following calltrace shows the deadlock:
>    Thread 1 takes (B) lock and tries to take folio (A) lock
>    Thread 2 takes (A) lock and tries to take folio (B) lock
>
> [Thread 1]
> INFO: task stress:1824 blocked for more than 30 seconds.
>       Tainted: G           OE      6.10.0-rc7+ #6
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:stress          state:D stack:0     pid:1824  tgid:1824  ppid:1822   flags:0x0000000c
> Call trace:
>  __switch_to+0xec/0x138
>  __schedule+0x43c/0xcb0
>  schedule+0x54/0x198
>  io_schedule+0x44/0x70
>  folio_wait_bit_common+0x184/0x3f8
> 			<-- folio mapping ffff00036d69cb18 index 996  (**)
>  __folio_lock+0x24/0x38
>  migrate_pages_batch+0x77c/0xea0	// try_split_folio (mm/migrate.c:1486:2)
> 						// migrate_pages_batch (mm/migrate.c:1734:16)
> 		<--- LIST_HEAD(unmap_folios) has
> 			..
> 			folio mapping 0xffff0000d184f1d8 index 1711;   (*)
> 			folio mapping 0xffff0000d184f1d8 index 1712;
> 			..
>  migrate_pages+0xb28/0xe90
>  compact_zone+0xa08/0x10f0
>  compact_node+0x9c/0x180
>  sysctl_compaction_handler+0x8c/0x118
>  proc_sys_call_handler+0x1a8/0x280
>  proc_sys_write+0x1c/0x30
>  vfs_write+0x240/0x380
>  ksys_write+0x78/0x118
>  __arm64_sys_write+0x24/0x38
>  invoke_syscall+0x78/0x108
>  el0_svc_common.constprop.0+0x48/0xf0
>  do_el0_svc+0x24/0x38
>  el0_svc+0x3c/0x148
>  el0t_64_sync_handler+0x100/0x130
>  el0t_64_sync+0x190/0x198
>
> [Thread 2]
> INFO: task stress:1825 blocked for more than 30 seconds.
>       Tainted: G           OE      6.10.0-rc7+ #6
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:stress          state:D stack:0     pid:1825  tgid:1825  ppid:1822   flags:0x0000000c
> Call trace:
>  __switch_to+0xec/0x138
>  __schedule+0x43c/0xcb0
>  schedule+0x54/0x198
>  io_schedule+0x44/0x70
>  folio_wait_bit_common+0x184/0x3f8
> 			<-- folio = 0xfffffdffc6b503c0 (mapping == 0xffff0000d184f1d8 index == 1711) (*)
>  __folio_lock+0x24/0x38
>  z_erofs_runqueue+0x384/0x9c0 [erofs]
>  z_erofs_readahead+0x21c/0x350 [erofs]       <-- folio mapping 0xffff00036d69cb18 range from [992, 1024] (**)
>  read_pages+0x74/0x328
>  page_cache_ra_order+0x26c/0x348
>  ondemand_readahead+0x1c0/0x3a0
>  page_cache_sync_ra+0x9c/0xc0
>  filemap_get_pages+0xc4/0x708
>  filemap_read+0x104/0x3a8
>  generic_file_read_iter+0x4c/0x150
>  vfs_read+0x27c/0x330
>  ksys_pread64+0x84/0xd0
>  __arm64_sys_pread64+0x28/0x40
>  invoke_syscall+0x78/0x108
>  el0_svc_common.constprop.0+0x48/0xf0
>  do_el0_svc+0x24/0x38
>  el0_svc+0x3c/0x148
>  el0t_64_sync_handler+0x100/0x130
>  el0t_64_sync+0x190/0x198
>
> Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move")
> Cc: "Huang, Ying" <ying.huang@intel.com>
> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> ---
>  mm/migrate.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 20cb9f5f7446..a912e4b83228 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1483,7 +1483,8 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
>  {
>  	int rc;
>  
> -	folio_lock(folio);
> +	if (!folio_trylock(folio))
> +		return -EAGAIN;
>  	rc = split_folio_to_list(folio, split_folios);
>  	folio_unlock(folio);
>  	if (!rc)

Good catch!  Thanks for the fixing!

The deadlock is similar as the one we fixed in commit fb3592c41a44
("migrate_pages: fix deadlock in batched migration").  But apparently,
we missed this case.

For the fix, I think that we should still respect migrate_mode because
users may prefer migration success over blocking.

@@ -1492,11 +1492,17 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
 	return rc;
 }
 
-static inline int try_split_folio(struct folio *folio, struct list_head *split_folios)
+static inline int try_split_folio(struct folio *folio, struct list_head *split_folios,
+				  enum migrate_mode mode)
 {
 	int rc;
 
-	folio_lock(folio);
+	if (mode == MIGRATE_ASYNC) {
+		if (!folio_trylock(folio))
+			return -EAGAIN;
+	} else {
+		folio_lock(folio);
+	}
 	rc = split_folio_to_list(folio, split_folios);
 	folio_unlock(folio);
 	if (!rc)


--
Best Regards,
Huang, Ying
Gao Xiang July 29, 2024, 1:58 a.m. UTC | #7
Hi Ying,

On 2024/7/29 09:38, Huang, Ying wrote:
> Hi, Xiang,
> 
> Gao Xiang <hsiangkao@linux.alibaba.com> writes:
> 
>> Currently, migrate_pages_batch() can lock multiple locked folios
>> with an arbitrary order.  Although folio_trylock() is used to avoid
>> deadlock as commit 2ef7dbb26990 ("migrate_pages: try migrate in batch
>> asynchronously firstly") mentioned, it seems try_split_folio() is
>> still missing.
>>
>> It was found by compaction stress test when I explicitly enable EROFS
>> compressed files to use large folios, which case I cannot reproduce with
>> the same workload if large folio support is off (current mainline).
>> Typically, filesystem reads (with locked file-backed folios) could use
>> another bdev/meta inode to load some other I/Os (e.g. inode extent
>> metadata or caching compressed data), so the locking order will be:
>>
>>    file-backed folios  (A)
>>       bdev/meta folios (B)
>>
>> The following calltrace shows the deadlock:
>>     Thread 1 takes (B) lock and tries to take folio (A) lock
>>     Thread 2 takes (A) lock and tries to take folio (B) lock
>>
>> [Thread 1]
>> INFO: task stress:1824 blocked for more than 30 seconds.
>>        Tainted: G           OE      6.10.0-rc7+ #6
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> task:stress          state:D stack:0     pid:1824  tgid:1824  ppid:1822   flags:0x0000000c
>> Call trace:
>>   __switch_to+0xec/0x138
>>   __schedule+0x43c/0xcb0
>>   schedule+0x54/0x198
>>   io_schedule+0x44/0x70
>>   folio_wait_bit_common+0x184/0x3f8
>> 			<-- folio mapping ffff00036d69cb18 index 996  (**)
>>   __folio_lock+0x24/0x38
>>   migrate_pages_batch+0x77c/0xea0	// try_split_folio (mm/migrate.c:1486:2)
>> 						// migrate_pages_batch (mm/migrate.c:1734:16)
>> 		<--- LIST_HEAD(unmap_folios) has
>> 			..
>> 			folio mapping 0xffff0000d184f1d8 index 1711;   (*)
>> 			folio mapping 0xffff0000d184f1d8 index 1712;
>> 			..
>>   migrate_pages+0xb28/0xe90
>>   compact_zone+0xa08/0x10f0
>>   compact_node+0x9c/0x180
>>   sysctl_compaction_handler+0x8c/0x118
>>   proc_sys_call_handler+0x1a8/0x280
>>   proc_sys_write+0x1c/0x30
>>   vfs_write+0x240/0x380
>>   ksys_write+0x78/0x118
>>   __arm64_sys_write+0x24/0x38
>>   invoke_syscall+0x78/0x108
>>   el0_svc_common.constprop.0+0x48/0xf0
>>   do_el0_svc+0x24/0x38
>>   el0_svc+0x3c/0x148
>>   el0t_64_sync_handler+0x100/0x130
>>   el0t_64_sync+0x190/0x198
>>
>> [Thread 2]
>> INFO: task stress:1825 blocked for more than 30 seconds.
>>        Tainted: G           OE      6.10.0-rc7+ #6
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> task:stress          state:D stack:0     pid:1825  tgid:1825  ppid:1822   flags:0x0000000c
>> Call trace:
>>   __switch_to+0xec/0x138
>>   __schedule+0x43c/0xcb0
>>   schedule+0x54/0x198
>>   io_schedule+0x44/0x70
>>   folio_wait_bit_common+0x184/0x3f8
>> 			<-- folio = 0xfffffdffc6b503c0 (mapping == 0xffff0000d184f1d8 index == 1711) (*)
>>   __folio_lock+0x24/0x38
>>   z_erofs_runqueue+0x384/0x9c0 [erofs]
>>   z_erofs_readahead+0x21c/0x350 [erofs]       <-- folio mapping 0xffff00036d69cb18 range from [992, 1024] (**)
>>   read_pages+0x74/0x328
>>   page_cache_ra_order+0x26c/0x348
>>   ondemand_readahead+0x1c0/0x3a0
>>   page_cache_sync_ra+0x9c/0xc0
>>   filemap_get_pages+0xc4/0x708
>>   filemap_read+0x104/0x3a8
>>   generic_file_read_iter+0x4c/0x150
>>   vfs_read+0x27c/0x330
>>   ksys_pread64+0x84/0xd0
>>   __arm64_sys_pread64+0x28/0x40
>>   invoke_syscall+0x78/0x108
>>   el0_svc_common.constprop.0+0x48/0xf0
>>   do_el0_svc+0x24/0x38
>>   el0_svc+0x3c/0x148
>>   el0t_64_sync_handler+0x100/0x130
>>   el0t_64_sync+0x190/0x198
>>
>> Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move")
>> Cc: "Huang, Ying" <ying.huang@intel.com>
>> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
>> ---
>>   mm/migrate.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 20cb9f5f7446..a912e4b83228 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1483,7 +1483,8 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
>>   {
>>   	int rc;
>>   
>> -	folio_lock(folio);
>> +	if (!folio_trylock(folio))
>> +		return -EAGAIN;
>>   	rc = split_folio_to_list(folio, split_folios);
>>   	folio_unlock(folio);
>>   	if (!rc)
> 
> Good catch!  Thanks for the fixing!
> 
> The deadlock is similar as the one we fixed in commit fb3592c41a44
> ("migrate_pages: fix deadlock in batched migration").  But apparently,
> we missed this case.
> 
> For the fix, I think that we should still respect migrate_mode because
> users may prefer migration success over blocking.
> 
> @@ -1492,11 +1492,17 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
>   	return rc;
>   }
>   
> -static inline int try_split_folio(struct folio *folio, struct list_head *split_folios)
> +static inline int try_split_folio(struct folio *folio, struct list_head *split_folios,
> +				  enum migrate_mode mode)
>   {
>   	int rc;
>   
> -	folio_lock(folio);
> +	if (mode == MIGRATE_ASYNC) {
> +		if (!folio_trylock(folio))
> +			return -EAGAIN;
> +	} else {
> +		folio_lock(folio);
> +	}
>   	rc = split_folio_to_list(folio, split_folios);
>   	folio_unlock(folio);
>   	if (!rc)

Okay, yeah it looks better since it seems I missed the fallback
part in migrate_pages_sync().

Let me send the next version to follow your advice, thanks.

Thanks,
Gao Xiang


> 
> 
> --
> Best Regards,
> Huang, Ying
Gao Xiang Aug. 2, 2024, 9:01 a.m. UTC | #8
Hi Matthew,

On 2024/7/29 06:11, Gao Xiang wrote:
> Hi,
> 
> On 2024/7/29 05:46, Matthew Wilcox wrote:
>> On Sun, Jul 28, 2024 at 11:49:13PM +0800, Gao Xiang wrote:
>>> It was found by compaction stress test when I explicitly enable EROFS
>>> compressed files to use large folios, which case I cannot reproduce with
>>> the same workload if large folio support is off (current mainline).
>>> Typically, filesystem reads (with locked file-backed folios) could use
>>> another bdev/meta inode to load some other I/Os (e.g. inode extent
>>> metadata or caching compressed data), so the locking order will be:
>>
>> Umm.  That is a new constraint to me.  We have two other places which
>> take the folio lock in a particular order.  Writeback takes locks on
>> folios belonging to the same inode in ascending ->index order.  It
>> submits all the folios for write before moving on to lock other inodes,
>> so it does not conflict with this new constraint you're proposing.
> 
> BTW, I don't believe it's a new order out of EROFS, if you consider
> ext4 or ext2 for example, it will also use sb_bread() (buffer heads
> on bdev inode to trigger some meta I/Os),
> 
> e.g. take ext2 for simplicity:
>    ext2_readahead
>      mpage_readahead
>       ext2_get_block
>         ext2_get_blocks
>           ext2_get_branch
>              sb_bread     <-- get some metadata using for this data I/O

I guess I need to write more words about this:

Although currently sb_bread() mainly take buffer locks to do meta I/Os,
but the following path takes the similar dependency:

                ...
                sb_bread
                  __bread_gfp
                    bdev_getblk
                      __getblk_slow
                        grow_dev_folio  // bdev->bd_mapping
                          __filemap_get_folio(FGP_LOCK | .. | FGP_CREAT)

So the order is already there for decades.. Although EROFS doesn't
use buffer heads since its initial version, it needs a different
address_space to cache metadata in page cache for best performance.

In .read_folio() and .readahead() context, the orders have to be

   file-backed folios
      bdev/meta folios

since it's hard to use any other orders and the file-backed folios
won't be filled without uptodated bdev/meta folios.

> 
>>
>> The other place is remap_file_range().  Both inodes in that case must be
>> regular files,
>>          if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
>>                  return -EINVAL;
>> so this new rule is fine.

Refer to vfs_dedupe_file_range_compare() and vfs_lock_two_folios(), it
seems it only considers folio->index regardless of address_spaces too.

>>
>> Does anybody know of any _other_ ordering constraints on folio locks?  I'm
>> willing to write them down ...
> 
> Personally I don't think out any particular order between two folio
> locks acrossing different inodes, so I think folio batching locking
> always needs to be taken care.


I think folio_lock() comment of different address_spaces added in
commit cd125eeab2de ("filemap: Update the folio_lock documentation")
would be better to be refined:

...
  * in the same address_space.  If they are in different address_spaces,
  * acquire the lock of the folio which belongs to the address_space which
  * has the lowest address in memory first.
  */
static inline void folio_lock(struct folio *folio)
{
...


Since there are several cases we cannot follow the comment above due
to .read_folio(), .readahead() and more contexts.

I'm not sure how to document the order of different address_spaces,
so I think it's just "no particular order between different
address_space".

Thanks,
Gao Xiang
Andrew Morton Aug. 16, 2024, 5:02 a.m. UTC | #9
On Mon, 29 Jul 2024 09:58:02 +0800 Gao Xiang <hsiangkao@linux.alibaba.com> wrote:

> > For the fix, I think that we should still respect migrate_mode because
> > users may prefer migration success over blocking.
> > 
> > @@ -1492,11 +1492,17 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
> >   	return rc;
> >   }
> >   
> > -static inline int try_split_folio(struct folio *folio, struct list_head *split_folios)
> > +static inline int try_split_folio(struct folio *folio, struct list_head *split_folios,
> > +				  enum migrate_mode mode)
> >   {
> >   	int rc;
> >   
> > -	folio_lock(folio);
> > +	if (mode == MIGRATE_ASYNC) {
> > +		if (!folio_trylock(folio))
> > +			return -EAGAIN;
> > +	} else {
> > +		folio_lock(folio);
> > +	}
> >   	rc = split_folio_to_list(folio, split_folios);
> >   	folio_unlock(folio);
> >   	if (!rc)
> 
> Okay, yeah it looks better since it seems I missed the fallback
> part in migrate_pages_sync().
> 
> Let me send the next version to follow your advice, thanks.

The author seems to have disappeared.  Should we merge this as-is or
does someone want to take a look at developing a v2?

Thanks.
Gao Xiang Aug. 16, 2024, 5:12 a.m. UTC | #10
Hi Andrew,

On 2024/8/16 13:02, Andrew Morton wrote:
> On Mon, 29 Jul 2024 09:58:02 +0800 Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> 
>>> For the fix, I think that we should still respect migrate_mode because
>>> users may prefer migration success over blocking.
>>>
>>> @@ -1492,11 +1492,17 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
>>>    	return rc;
>>>    }
>>>    
>>> -static inline int try_split_folio(struct folio *folio, struct list_head *split_folios)
>>> +static inline int try_split_folio(struct folio *folio, struct list_head *split_folios,
>>> +				  enum migrate_mode mode)
>>>    {
>>>    	int rc;
>>>    
>>> -	folio_lock(folio);
>>> +	if (mode == MIGRATE_ASYNC) {
>>> +		if (!folio_trylock(folio))
>>> +			return -EAGAIN;
>>> +	} else {
>>> +		folio_lock(folio);
>>> +	}
>>>    	rc = split_folio_to_list(folio, split_folios);
>>>    	folio_unlock(folio);
>>>    	if (!rc)
>>
>> Okay, yeah it looks better since it seems I missed the fallback
>> part in migrate_pages_sync().
>>
>> Let me send the next version to follow your advice, thanks.
> 
> The author seems to have disappeared.  Should we merge this as-is or
> does someone want to take a look at developing a v2?

I've replied your email last week, I'm not sure why it has not
been addressed?

https://lore.kernel.org/linux-mm/20240729021306.398286-1-hsiangkao@linux.alibaba.com/

The patch in your queue is already v2? No?

Thanks,
Gao Xiang
Gao Xiang Aug. 16, 2024, 5:17 a.m. UTC | #11
On 2024/8/16 13:12, Gao Xiang wrote:
> 
> Hi Andrew,
> 
> On 2024/8/16 13:02, Andrew Morton wrote:
>> On Mon, 29 Jul 2024 09:58:02 +0800 Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>>>> For the fix, I think that we should still respect migrate_mode because
>>>> users may prefer migration success over blocking.
>>>>
>>>> @@ -1492,11 +1492,17 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
>>>>        return rc;
>>>>    }
>>>> -static inline int try_split_folio(struct folio *folio, struct list_head *split_folios)
>>>> +static inline int try_split_folio(struct folio *folio, struct list_head *split_folios,
>>>> +                  enum migrate_mode mode)
>>>>    {
>>>>        int rc;
>>>> -    folio_lock(folio);
>>>> +    if (mode == MIGRATE_ASYNC) {
>>>> +        if (!folio_trylock(folio))
>>>> +            return -EAGAIN;
>>>> +    } else {
>>>> +        folio_lock(folio);
>>>> +    }
>>>>        rc = split_folio_to_list(folio, split_folios);
>>>>        folio_unlock(folio);
>>>>        if (!rc)
>>>
>>> Okay, yeah it looks better since it seems I missed the fallback
>>> part in migrate_pages_sync().
>>>
>>> Let me send the next version to follow your advice, thanks.
>>
>> The author seems to have disappeared.  Should we merge this as-is or
>> does someone want to take a look at developing a v2?
> 
> I've replied your email last week, I'm not sure why it has not
> been addressed?
> 
> https://lore.kernel.org/linux-mm/20240729021306.398286-1-hsiangkao@linux.alibaba.com/
> 
> The patch in your queue is already v2? No?

Really confused about this, since the comment above was about v1.
and v2 is already sent (in July 29) and in -next for two weeks
with

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>

What else I need to do to resolve this already resolved comment
so that I could enable large folios without deadlocks anymore?

Thanks,
Gao Xiang

> 
> Thanks,
> Gao Xiang
Andrew Morton Aug. 16, 2024, 5:25 a.m. UTC | #12
On Fri, 16 Aug 2024 13:17:04 +0800 Gao Xiang <hsiangkao@linux.alibaba.com> wrote:

> >> does someone want to take a look at developing a v2?
> > 
> > I've replied your email last week, I'm not sure why it has not
> > been addressed?
> > 
> > https://lore.kernel.org/linux-mm/20240729021306.398286-1-hsiangkao@linux.alibaba.com/
> > 
> > The patch in your queue is already v2? No?
> 
> Really confused about this, since the comment above was about v1.
> and v2 is already sent (in July 29) and in -next for two weeks
> with
> 
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> 
> What else I need to do to resolve this already resolved comment
> so that I could enable large folios without deadlocks anymore?

Is cool, thanks - I failed to correctly update my notes-to-self.
Gao Xiang Aug. 16, 2024, 5:32 a.m. UTC | #13
On 2024/8/16 13:25, Andrew Morton wrote:
> On Fri, 16 Aug 2024 13:17:04 +0800 Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> 
>>>> does someone want to take a look at developing a v2?
>>>
>>> I've replied your email last week, I'm not sure why it has not
>>> been addressed?
>>>
>>> https://lore.kernel.org/linux-mm/20240729021306.398286-1-hsiangkao@linux.alibaba.com/
>>>
>>> The patch in your queue is already v2? No?
>>
>> Really confused about this, since the comment above was about v1.
>> and v2 is already sent (in July 29) and in -next for two weeks
>> with
>>
>> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
>> Acked-by: David Hildenbrand <david@redhat.com>
>>
>> What else I need to do to resolve this already resolved comment
>> so that I could enable large folios without deadlocks anymore?
> 
> Is cool, thanks - I failed to correctly update my notes-to-self.

Many thanks for confirmation and hopefully it could upstream
so I could enable large folios safely.

Thanks,
Gao Xiang
diff mbox series

Patch

diff --git a/mm/migrate.c b/mm/migrate.c
index 20cb9f5f7446..a912e4b83228 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1483,7 +1483,8 @@  static inline int try_split_folio(struct folio *folio, struct list_head *split_f
 {
 	int rc;
 
-	folio_lock(folio);
+	if (!folio_trylock(folio))
+		return -EAGAIN;
 	rc = split_folio_to_list(folio, split_folios);
 	folio_unlock(folio);
 	if (!rc)