Message ID | 20240728154913.4023977-1-hsiangkao@linux.alibaba.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/migrate: fix deadlock in migrate_pages_batch() on large folios | expand |
On Sun, 28 Jul 2024 23:49:13 +0800 Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > Currently, migrate_pages_batch() can lock multiple locked folios > with an arbitrary order. Although folio_trylock() is used to avoid > deadlock as commit 2ef7dbb26990 ("migrate_pages: try migrate in batch > asynchronously firstly") mentioned, it seems try_split_folio() is > still missing. Am I correct in believing that folio_lock() doesn't have lockdep coverage? > It was found by compaction stress test when I explicitly enable EROFS > compressed files to use large folios, which case I cannot reproduce with > the same workload if large folio support is off (current mainline). > Typically, filesystem reads (with locked file-backed folios) could use > another bdev/meta inode to load some other I/Os (e.g. inode extent > metadata or caching compressed data), so the locking order will be: Which kernels need fixing. Do we expect that any code paths in 6.10 or earlier are vulnerable to this?
On Sun, Jul 28, 2024 at 12:50:05PM -0700, Andrew Morton wrote: > On Sun, 28 Jul 2024 23:49:13 +0800 Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > Currently, migrate_pages_batch() can lock multiple locked folios > > with an arbitrary order. Although folio_trylock() is used to avoid > > deadlock as commit 2ef7dbb26990 ("migrate_pages: try migrate in batch > > asynchronously firstly") mentioned, it seems try_split_folio() is > > still missing. > > Am I correct in believing that folio_lock() doesn't have lockdep coverage? Yes. It can't; it is taken in process context and released by whatever context the read completion happens in (could be hard/soft irq, could be a workqueue, could be J. Random kthread, depending on the device driver) So it doesn't match the lockdep model at all. > > It was found by compaction stress test when I explicitly enable EROFS > > compressed files to use large folios, which case I cannot reproduce with > > the same workload if large folio support is off (current mainline). > > Typically, filesystem reads (with locked file-backed folios) could use > > another bdev/meta inode to load some other I/Os (e.g. inode extent > > metadata or caching compressed data), so the locking order will be: > > Which kernels need fixing. Do we expect that any code paths in 6.10 or > earlier are vulnerable to this? I would suggest it goes back to the introduction of large folios, but that's just a gut feeling based on absolutely no reading of code or inspection of git history.
On Sun, Jul 28, 2024 at 11:49:13PM +0800, Gao Xiang wrote: > It was found by compaction stress test when I explicitly enable EROFS > compressed files to use large folios, which case I cannot reproduce with > the same workload if large folio support is off (current mainline). > Typically, filesystem reads (with locked file-backed folios) could use > another bdev/meta inode to load some other I/Os (e.g. inode extent > metadata or caching compressed data), so the locking order will be: Umm. That is a new constraint to me. We have two other places which take the folio lock in a particular order. Writeback takes locks on folios belonging to the same inode in ascending ->index order. It submits all the folios for write before moving on to lock other inodes, so it does not conflict with this new constraint you're proposing. The other place is remap_file_range(). Both inodes in that case must be regular files, if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) return -EINVAL; so this new rule is fine. Does anybody know of any _other_ ordering constraints on folio locks? I'm willing to write them down ... > diff --git a/mm/migrate.c b/mm/migrate.c > index 20cb9f5f7446..a912e4b83228 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1483,7 +1483,8 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f > { > int rc; > > - folio_lock(folio); > + if (!folio_trylock(folio)) > + return -EAGAIN; > rc = split_folio_to_list(folio, split_folios); > folio_unlock(folio); > if (!rc) This feels like the best quick fix to me since migration is going to walk the folios in a different order from writeback. I'm surprised this hasn't already bitten us, to be honest. (ie I don't think this is even necessarily connected to the new ordering constraint; I think migration and writeback can already deadlock)
Hi, On 2024/7/29 05:46, Matthew Wilcox wrote: > On Sun, Jul 28, 2024 at 11:49:13PM +0800, Gao Xiang wrote: >> It was found by compaction stress test when I explicitly enable EROFS >> compressed files to use large folios, which case I cannot reproduce with >> the same workload if large folio support is off (current mainline). >> Typically, filesystem reads (with locked file-backed folios) could use >> another bdev/meta inode to load some other I/Os (e.g. inode extent >> metadata or caching compressed data), so the locking order will be: > > Umm. That is a new constraint to me. We have two other places which > take the folio lock in a particular order. Writeback takes locks on > folios belonging to the same inode in ascending ->index order. It > submits all the folios for write before moving on to lock other inodes, > so it does not conflict with this new constraint you're proposing. BTW, I don't believe it's a new order out of EROFS, if you consider ext4 or ext2 for example, it will also use sb_bread() (buffer heads on bdev inode to trigger some meta I/Os), e.g. take ext2 for simplicity: ext2_readahead mpage_readahead ext2_get_block ext2_get_blocks ext2_get_branch sb_bread <-- get some metadata using for this data I/O > > The other place is remap_file_range(). Both inodes in that case must be > regular files, > if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) > return -EINVAL; > so this new rule is fine. > > Does anybody know of any _other_ ordering constraints on folio locks? I'm > willing to write them down ... Personally I don't think out any particular order between two folio locks acrossing different inodes, so I think folio batching locking always needs to be taken care. > >> diff --git a/mm/migrate.c b/mm/migrate.c >> index 20cb9f5f7446..a912e4b83228 100644 >> --- a/mm/migrate.c >> +++ b/mm/migrate.c >> @@ -1483,7 +1483,8 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f >> { >> int rc; >> >> - folio_lock(folio); >> + if (!folio_trylock(folio)) >> + return -EAGAIN; >> rc = split_folio_to_list(folio, split_folios); >> folio_unlock(folio); >> if (!rc) > > This feels like the best quick fix to me since migration is going to > walk the folios in a different order from writeback. I'm surprised > this hasn't already bitten us, to be honest. My stress workload explicitly triggers compaction and other EROFS read loads, I'm not sure if others just test like this too, but: https://lore.kernel.org/r/20240418001356.95857-1-mcgrof@kernel.org seems like a similar load. Thanks, Gao Xiang > > (ie I don't think this is even necessarily connected to the new > ordering constraint; I think migration and writeback can already > deadlock)
Hi, On 2024/7/29 05:17, Matthew Wilcox wrote: > On Sun, Jul 28, 2024 at 12:50:05PM -0700, Andrew Morton wrote: >> On Sun, 28 Jul 2024 23:49:13 +0800 Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >>> Currently, migrate_pages_batch() can lock multiple locked folios >>> with an arbitrary order. Although folio_trylock() is used to avoid >>> deadlock as commit 2ef7dbb26990 ("migrate_pages: try migrate in batch >>> asynchronously firstly") mentioned, it seems try_split_folio() is >>> still missing. >> >> Am I correct in believing that folio_lock() doesn't have lockdep coverage? > > Yes. It can't; it is taken in process context and released by whatever > context the read completion happens in (could be hard/soft irq, could be > a workqueue, could be J. Random kthread, depending on the device driver) > So it doesn't match the lockdep model at all. > >>> It was found by compaction stress test when I explicitly enable EROFS >>> compressed files to use large folios, which case I cannot reproduce with >>> the same workload if large folio support is off (current mainline). >>> Typically, filesystem reads (with locked file-backed folios) could use >>> another bdev/meta inode to load some other I/Os (e.g. inode extent >>> metadata or caching compressed data), so the locking order will be: >> >> Which kernels need fixing. Do we expect that any code paths in 6.10 or >> earlier are vulnerable to this? > > I would suggest it goes back to the introduction of large folios, but > that's just a gut feeling based on absolutely no reading of code or > inspection of git history. According to 5dfab109d519 ("migrate_pages: batch _unmap and _move"), I think it's v6.3+. Yet I don't have more time to look info all history of batching migration, hoping Huang, Ying could give more hints on this. Thanks, Gao Xiang
Hi, Xiang, Gao Xiang <hsiangkao@linux.alibaba.com> writes: > Currently, migrate_pages_batch() can lock multiple locked folios > with an arbitrary order. Although folio_trylock() is used to avoid > deadlock as commit 2ef7dbb26990 ("migrate_pages: try migrate in batch > asynchronously firstly") mentioned, it seems try_split_folio() is > still missing. > > It was found by compaction stress test when I explicitly enable EROFS > compressed files to use large folios, which case I cannot reproduce with > the same workload if large folio support is off (current mainline). > Typically, filesystem reads (with locked file-backed folios) could use > another bdev/meta inode to load some other I/Os (e.g. inode extent > metadata or caching compressed data), so the locking order will be: > > file-backed folios (A) > bdev/meta folios (B) > > The following calltrace shows the deadlock: > Thread 1 takes (B) lock and tries to take folio (A) lock > Thread 2 takes (A) lock and tries to take folio (B) lock > > [Thread 1] > INFO: task stress:1824 blocked for more than 30 seconds. > Tainted: G OE 6.10.0-rc7+ #6 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:stress state:D stack:0 pid:1824 tgid:1824 ppid:1822 flags:0x0000000c > Call trace: > __switch_to+0xec/0x138 > __schedule+0x43c/0xcb0 > schedule+0x54/0x198 > io_schedule+0x44/0x70 > folio_wait_bit_common+0x184/0x3f8 > <-- folio mapping ffff00036d69cb18 index 996 (**) > __folio_lock+0x24/0x38 > migrate_pages_batch+0x77c/0xea0 // try_split_folio (mm/migrate.c:1486:2) > // migrate_pages_batch (mm/migrate.c:1734:16) > <--- LIST_HEAD(unmap_folios) has > .. > folio mapping 0xffff0000d184f1d8 index 1711; (*) > folio mapping 0xffff0000d184f1d8 index 1712; > .. > migrate_pages+0xb28/0xe90 > compact_zone+0xa08/0x10f0 > compact_node+0x9c/0x180 > sysctl_compaction_handler+0x8c/0x118 > proc_sys_call_handler+0x1a8/0x280 > proc_sys_write+0x1c/0x30 > vfs_write+0x240/0x380 > ksys_write+0x78/0x118 > __arm64_sys_write+0x24/0x38 > invoke_syscall+0x78/0x108 > el0_svc_common.constprop.0+0x48/0xf0 > do_el0_svc+0x24/0x38 > el0_svc+0x3c/0x148 > el0t_64_sync_handler+0x100/0x130 > el0t_64_sync+0x190/0x198 > > [Thread 2] > INFO: task stress:1825 blocked for more than 30 seconds. > Tainted: G OE 6.10.0-rc7+ #6 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:stress state:D stack:0 pid:1825 tgid:1825 ppid:1822 flags:0x0000000c > Call trace: > __switch_to+0xec/0x138 > __schedule+0x43c/0xcb0 > schedule+0x54/0x198 > io_schedule+0x44/0x70 > folio_wait_bit_common+0x184/0x3f8 > <-- folio = 0xfffffdffc6b503c0 (mapping == 0xffff0000d184f1d8 index == 1711) (*) > __folio_lock+0x24/0x38 > z_erofs_runqueue+0x384/0x9c0 [erofs] > z_erofs_readahead+0x21c/0x350 [erofs] <-- folio mapping 0xffff00036d69cb18 range from [992, 1024] (**) > read_pages+0x74/0x328 > page_cache_ra_order+0x26c/0x348 > ondemand_readahead+0x1c0/0x3a0 > page_cache_sync_ra+0x9c/0xc0 > filemap_get_pages+0xc4/0x708 > filemap_read+0x104/0x3a8 > generic_file_read_iter+0x4c/0x150 > vfs_read+0x27c/0x330 > ksys_pread64+0x84/0xd0 > __arm64_sys_pread64+0x28/0x40 > invoke_syscall+0x78/0x108 > el0_svc_common.constprop.0+0x48/0xf0 > do_el0_svc+0x24/0x38 > el0_svc+0x3c/0x148 > el0t_64_sync_handler+0x100/0x130 > el0t_64_sync+0x190/0x198 > > Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") > Cc: "Huang, Ying" <ying.huang@intel.com> > Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> > --- > mm/migrate.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index 20cb9f5f7446..a912e4b83228 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1483,7 +1483,8 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f > { > int rc; > > - folio_lock(folio); > + if (!folio_trylock(folio)) > + return -EAGAIN; > rc = split_folio_to_list(folio, split_folios); > folio_unlock(folio); > if (!rc) Good catch! Thanks for the fixing! The deadlock is similar as the one we fixed in commit fb3592c41a44 ("migrate_pages: fix deadlock in batched migration"). But apparently, we missed this case. For the fix, I think that we should still respect migrate_mode because users may prefer migration success over blocking. @@ -1492,11 +1492,17 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio, return rc; } -static inline int try_split_folio(struct folio *folio, struct list_head *split_folios) +static inline int try_split_folio(struct folio *folio, struct list_head *split_folios, + enum migrate_mode mode) { int rc; - folio_lock(folio); + if (mode == MIGRATE_ASYNC) { + if (!folio_trylock(folio)) + return -EAGAIN; + } else { + folio_lock(folio); + } rc = split_folio_to_list(folio, split_folios); folio_unlock(folio); if (!rc) -- Best Regards, Huang, Ying
Hi Ying, On 2024/7/29 09:38, Huang, Ying wrote: > Hi, Xiang, > > Gao Xiang <hsiangkao@linux.alibaba.com> writes: > >> Currently, migrate_pages_batch() can lock multiple locked folios >> with an arbitrary order. Although folio_trylock() is used to avoid >> deadlock as commit 2ef7dbb26990 ("migrate_pages: try migrate in batch >> asynchronously firstly") mentioned, it seems try_split_folio() is >> still missing. >> >> It was found by compaction stress test when I explicitly enable EROFS >> compressed files to use large folios, which case I cannot reproduce with >> the same workload if large folio support is off (current mainline). >> Typically, filesystem reads (with locked file-backed folios) could use >> another bdev/meta inode to load some other I/Os (e.g. inode extent >> metadata or caching compressed data), so the locking order will be: >> >> file-backed folios (A) >> bdev/meta folios (B) >> >> The following calltrace shows the deadlock: >> Thread 1 takes (B) lock and tries to take folio (A) lock >> Thread 2 takes (A) lock and tries to take folio (B) lock >> >> [Thread 1] >> INFO: task stress:1824 blocked for more than 30 seconds. >> Tainted: G OE 6.10.0-rc7+ #6 >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> task:stress state:D stack:0 pid:1824 tgid:1824 ppid:1822 flags:0x0000000c >> Call trace: >> __switch_to+0xec/0x138 >> __schedule+0x43c/0xcb0 >> schedule+0x54/0x198 >> io_schedule+0x44/0x70 >> folio_wait_bit_common+0x184/0x3f8 >> <-- folio mapping ffff00036d69cb18 index 996 (**) >> __folio_lock+0x24/0x38 >> migrate_pages_batch+0x77c/0xea0 // try_split_folio (mm/migrate.c:1486:2) >> // migrate_pages_batch (mm/migrate.c:1734:16) >> <--- LIST_HEAD(unmap_folios) has >> .. >> folio mapping 0xffff0000d184f1d8 index 1711; (*) >> folio mapping 0xffff0000d184f1d8 index 1712; >> .. >> migrate_pages+0xb28/0xe90 >> compact_zone+0xa08/0x10f0 >> compact_node+0x9c/0x180 >> sysctl_compaction_handler+0x8c/0x118 >> proc_sys_call_handler+0x1a8/0x280 >> proc_sys_write+0x1c/0x30 >> vfs_write+0x240/0x380 >> ksys_write+0x78/0x118 >> __arm64_sys_write+0x24/0x38 >> invoke_syscall+0x78/0x108 >> el0_svc_common.constprop.0+0x48/0xf0 >> do_el0_svc+0x24/0x38 >> el0_svc+0x3c/0x148 >> el0t_64_sync_handler+0x100/0x130 >> el0t_64_sync+0x190/0x198 >> >> [Thread 2] >> INFO: task stress:1825 blocked for more than 30 seconds. >> Tainted: G OE 6.10.0-rc7+ #6 >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> task:stress state:D stack:0 pid:1825 tgid:1825 ppid:1822 flags:0x0000000c >> Call trace: >> __switch_to+0xec/0x138 >> __schedule+0x43c/0xcb0 >> schedule+0x54/0x198 >> io_schedule+0x44/0x70 >> folio_wait_bit_common+0x184/0x3f8 >> <-- folio = 0xfffffdffc6b503c0 (mapping == 0xffff0000d184f1d8 index == 1711) (*) >> __folio_lock+0x24/0x38 >> z_erofs_runqueue+0x384/0x9c0 [erofs] >> z_erofs_readahead+0x21c/0x350 [erofs] <-- folio mapping 0xffff00036d69cb18 range from [992, 1024] (**) >> read_pages+0x74/0x328 >> page_cache_ra_order+0x26c/0x348 >> ondemand_readahead+0x1c0/0x3a0 >> page_cache_sync_ra+0x9c/0xc0 >> filemap_get_pages+0xc4/0x708 >> filemap_read+0x104/0x3a8 >> generic_file_read_iter+0x4c/0x150 >> vfs_read+0x27c/0x330 >> ksys_pread64+0x84/0xd0 >> __arm64_sys_pread64+0x28/0x40 >> invoke_syscall+0x78/0x108 >> el0_svc_common.constprop.0+0x48/0xf0 >> do_el0_svc+0x24/0x38 >> el0_svc+0x3c/0x148 >> el0t_64_sync_handler+0x100/0x130 >> el0t_64_sync+0x190/0x198 >> >> Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") >> Cc: "Huang, Ying" <ying.huang@intel.com> >> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> >> --- >> mm/migrate.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/mm/migrate.c b/mm/migrate.c >> index 20cb9f5f7446..a912e4b83228 100644 >> --- a/mm/migrate.c >> +++ b/mm/migrate.c >> @@ -1483,7 +1483,8 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f >> { >> int rc; >> >> - folio_lock(folio); >> + if (!folio_trylock(folio)) >> + return -EAGAIN; >> rc = split_folio_to_list(folio, split_folios); >> folio_unlock(folio); >> if (!rc) > > Good catch! Thanks for the fixing! > > The deadlock is similar as the one we fixed in commit fb3592c41a44 > ("migrate_pages: fix deadlock in batched migration"). But apparently, > we missed this case. > > For the fix, I think that we should still respect migrate_mode because > users may prefer migration success over blocking. > > @@ -1492,11 +1492,17 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio, > return rc; > } > > -static inline int try_split_folio(struct folio *folio, struct list_head *split_folios) > +static inline int try_split_folio(struct folio *folio, struct list_head *split_folios, > + enum migrate_mode mode) > { > int rc; > > - folio_lock(folio); > + if (mode == MIGRATE_ASYNC) { > + if (!folio_trylock(folio)) > + return -EAGAIN; > + } else { > + folio_lock(folio); > + } > rc = split_folio_to_list(folio, split_folios); > folio_unlock(folio); > if (!rc) Okay, yeah it looks better since it seems I missed the fallback part in migrate_pages_sync(). Let me send the next version to follow your advice, thanks. Thanks, Gao Xiang > > > -- > Best Regards, > Huang, Ying
Hi Matthew, On 2024/7/29 06:11, Gao Xiang wrote: > Hi, > > On 2024/7/29 05:46, Matthew Wilcox wrote: >> On Sun, Jul 28, 2024 at 11:49:13PM +0800, Gao Xiang wrote: >>> It was found by compaction stress test when I explicitly enable EROFS >>> compressed files to use large folios, which case I cannot reproduce with >>> the same workload if large folio support is off (current mainline). >>> Typically, filesystem reads (with locked file-backed folios) could use >>> another bdev/meta inode to load some other I/Os (e.g. inode extent >>> metadata or caching compressed data), so the locking order will be: >> >> Umm. That is a new constraint to me. We have two other places which >> take the folio lock in a particular order. Writeback takes locks on >> folios belonging to the same inode in ascending ->index order. It >> submits all the folios for write before moving on to lock other inodes, >> so it does not conflict with this new constraint you're proposing. > > BTW, I don't believe it's a new order out of EROFS, if you consider > ext4 or ext2 for example, it will also use sb_bread() (buffer heads > on bdev inode to trigger some meta I/Os), > > e.g. take ext2 for simplicity: > ext2_readahead > mpage_readahead > ext2_get_block > ext2_get_blocks > ext2_get_branch > sb_bread <-- get some metadata using for this data I/O I guess I need to write more words about this: Although currently sb_bread() mainly take buffer locks to do meta I/Os, but the following path takes the similar dependency: ... sb_bread __bread_gfp bdev_getblk __getblk_slow grow_dev_folio // bdev->bd_mapping __filemap_get_folio(FGP_LOCK | .. | FGP_CREAT) So the order is already there for decades.. Although EROFS doesn't use buffer heads since its initial version, it needs a different address_space to cache metadata in page cache for best performance. In .read_folio() and .readahead() context, the orders have to be file-backed folios bdev/meta folios since it's hard to use any other orders and the file-backed folios won't be filled without uptodated bdev/meta folios. > >> >> The other place is remap_file_range(). Both inodes in that case must be >> regular files, >> if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) >> return -EINVAL; >> so this new rule is fine. Refer to vfs_dedupe_file_range_compare() and vfs_lock_two_folios(), it seems it only considers folio->index regardless of address_spaces too. >> >> Does anybody know of any _other_ ordering constraints on folio locks? I'm >> willing to write them down ... > > Personally I don't think out any particular order between two folio > locks acrossing different inodes, so I think folio batching locking > always needs to be taken care. I think folio_lock() comment of different address_spaces added in commit cd125eeab2de ("filemap: Update the folio_lock documentation") would be better to be refined: ... * in the same address_space. If they are in different address_spaces, * acquire the lock of the folio which belongs to the address_space which * has the lowest address in memory first. */ static inline void folio_lock(struct folio *folio) { ... Since there are several cases we cannot follow the comment above due to .read_folio(), .readahead() and more contexts. I'm not sure how to document the order of different address_spaces, so I think it's just "no particular order between different address_space". Thanks, Gao Xiang
On Mon, 29 Jul 2024 09:58:02 +0800 Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > For the fix, I think that we should still respect migrate_mode because > > users may prefer migration success over blocking. > > > > @@ -1492,11 +1492,17 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio, > > return rc; > > } > > > > -static inline int try_split_folio(struct folio *folio, struct list_head *split_folios) > > +static inline int try_split_folio(struct folio *folio, struct list_head *split_folios, > > + enum migrate_mode mode) > > { > > int rc; > > > > - folio_lock(folio); > > + if (mode == MIGRATE_ASYNC) { > > + if (!folio_trylock(folio)) > > + return -EAGAIN; > > + } else { > > + folio_lock(folio); > > + } > > rc = split_folio_to_list(folio, split_folios); > > folio_unlock(folio); > > if (!rc) > > Okay, yeah it looks better since it seems I missed the fallback > part in migrate_pages_sync(). > > Let me send the next version to follow your advice, thanks. The author seems to have disappeared. Should we merge this as-is or does someone want to take a look at developing a v2? Thanks.
Hi Andrew, On 2024/8/16 13:02, Andrew Morton wrote: > On Mon, 29 Jul 2024 09:58:02 +0800 Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > >>> For the fix, I think that we should still respect migrate_mode because >>> users may prefer migration success over blocking. >>> >>> @@ -1492,11 +1492,17 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio, >>> return rc; >>> } >>> >>> -static inline int try_split_folio(struct folio *folio, struct list_head *split_folios) >>> +static inline int try_split_folio(struct folio *folio, struct list_head *split_folios, >>> + enum migrate_mode mode) >>> { >>> int rc; >>> >>> - folio_lock(folio); >>> + if (mode == MIGRATE_ASYNC) { >>> + if (!folio_trylock(folio)) >>> + return -EAGAIN; >>> + } else { >>> + folio_lock(folio); >>> + } >>> rc = split_folio_to_list(folio, split_folios); >>> folio_unlock(folio); >>> if (!rc) >> >> Okay, yeah it looks better since it seems I missed the fallback >> part in migrate_pages_sync(). >> >> Let me send the next version to follow your advice, thanks. > > The author seems to have disappeared. Should we merge this as-is or > does someone want to take a look at developing a v2? I've replied your email last week, I'm not sure why it has not been addressed? https://lore.kernel.org/linux-mm/20240729021306.398286-1-hsiangkao@linux.alibaba.com/ The patch in your queue is already v2? No? Thanks, Gao Xiang
On 2024/8/16 13:12, Gao Xiang wrote: > > Hi Andrew, > > On 2024/8/16 13:02, Andrew Morton wrote: >> On Mon, 29 Jul 2024 09:58:02 +0800 Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >> >>>> For the fix, I think that we should still respect migrate_mode because >>>> users may prefer migration success over blocking. >>>> >>>> @@ -1492,11 +1492,17 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio, >>>> return rc; >>>> } >>>> -static inline int try_split_folio(struct folio *folio, struct list_head *split_folios) >>>> +static inline int try_split_folio(struct folio *folio, struct list_head *split_folios, >>>> + enum migrate_mode mode) >>>> { >>>> int rc; >>>> - folio_lock(folio); >>>> + if (mode == MIGRATE_ASYNC) { >>>> + if (!folio_trylock(folio)) >>>> + return -EAGAIN; >>>> + } else { >>>> + folio_lock(folio); >>>> + } >>>> rc = split_folio_to_list(folio, split_folios); >>>> folio_unlock(folio); >>>> if (!rc) >>> >>> Okay, yeah it looks better since it seems I missed the fallback >>> part in migrate_pages_sync(). >>> >>> Let me send the next version to follow your advice, thanks. >> >> The author seems to have disappeared. Should we merge this as-is or >> does someone want to take a look at developing a v2? > > I've replied your email last week, I'm not sure why it has not > been addressed? > > https://lore.kernel.org/linux-mm/20240729021306.398286-1-hsiangkao@linux.alibaba.com/ > > The patch in your queue is already v2? No? Really confused about this, since the comment above was about v1. and v2 is already sent (in July 29) and in -next for two weeks with Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: David Hildenbrand <david@redhat.com> What else I need to do to resolve this already resolved comment so that I could enable large folios without deadlocks anymore? Thanks, Gao Xiang > > Thanks, > Gao Xiang
On Fri, 16 Aug 2024 13:17:04 +0800 Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > >> does someone want to take a look at developing a v2? > > > > I've replied your email last week, I'm not sure why it has not > > been addressed? > > > > https://lore.kernel.org/linux-mm/20240729021306.398286-1-hsiangkao@linux.alibaba.com/ > > > > The patch in your queue is already v2? No? > > Really confused about this, since the comment above was about v1. > and v2 is already sent (in July 29) and in -next for two weeks > with > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > Acked-by: David Hildenbrand <david@redhat.com> > > What else I need to do to resolve this already resolved comment > so that I could enable large folios without deadlocks anymore? Is cool, thanks - I failed to correctly update my notes-to-self.
On 2024/8/16 13:25, Andrew Morton wrote: > On Fri, 16 Aug 2024 13:17:04 +0800 Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > >>>> does someone want to take a look at developing a v2? >>> >>> I've replied your email last week, I'm not sure why it has not >>> been addressed? >>> >>> https://lore.kernel.org/linux-mm/20240729021306.398286-1-hsiangkao@linux.alibaba.com/ >>> >>> The patch in your queue is already v2? No? >> >> Really confused about this, since the comment above was about v1. >> and v2 is already sent (in July 29) and in -next for two weeks >> with >> >> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> >> Acked-by: David Hildenbrand <david@redhat.com> >> >> What else I need to do to resolve this already resolved comment >> so that I could enable large folios without deadlocks anymore? > > Is cool, thanks - I failed to correctly update my notes-to-self. Many thanks for confirmation and hopefully it could upstream so I could enable large folios safely. Thanks, Gao Xiang
diff --git a/mm/migrate.c b/mm/migrate.c index 20cb9f5f7446..a912e4b83228 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1483,7 +1483,8 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f { int rc; - folio_lock(folio); + if (!folio_trylock(folio)) + return -EAGAIN; rc = split_folio_to_list(folio, split_folios); folio_unlock(folio); if (!rc)
Currently, migrate_pages_batch() can lock multiple locked folios with an arbitrary order. Although folio_trylock() is used to avoid deadlock as commit 2ef7dbb26990 ("migrate_pages: try migrate in batch asynchronously firstly") mentioned, it seems try_split_folio() is still missing. It was found by compaction stress test when I explicitly enable EROFS compressed files to use large folios, which case I cannot reproduce with the same workload if large folio support is off (current mainline). Typically, filesystem reads (with locked file-backed folios) could use another bdev/meta inode to load some other I/Os (e.g. inode extent metadata or caching compressed data), so the locking order will be: file-backed folios (A) bdev/meta folios (B) The following calltrace shows the deadlock: Thread 1 takes (B) lock and tries to take folio (A) lock Thread 2 takes (A) lock and tries to take folio (B) lock [Thread 1] INFO: task stress:1824 blocked for more than 30 seconds. Tainted: G OE 6.10.0-rc7+ #6 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:stress state:D stack:0 pid:1824 tgid:1824 ppid:1822 flags:0x0000000c Call trace: __switch_to+0xec/0x138 __schedule+0x43c/0xcb0 schedule+0x54/0x198 io_schedule+0x44/0x70 folio_wait_bit_common+0x184/0x3f8 <-- folio mapping ffff00036d69cb18 index 996 (**) __folio_lock+0x24/0x38 migrate_pages_batch+0x77c/0xea0 // try_split_folio (mm/migrate.c:1486:2) // migrate_pages_batch (mm/migrate.c:1734:16) <--- LIST_HEAD(unmap_folios) has .. folio mapping 0xffff0000d184f1d8 index 1711; (*) folio mapping 0xffff0000d184f1d8 index 1712; .. migrate_pages+0xb28/0xe90 compact_zone+0xa08/0x10f0 compact_node+0x9c/0x180 sysctl_compaction_handler+0x8c/0x118 proc_sys_call_handler+0x1a8/0x280 proc_sys_write+0x1c/0x30 vfs_write+0x240/0x380 ksys_write+0x78/0x118 __arm64_sys_write+0x24/0x38 invoke_syscall+0x78/0x108 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x3c/0x148 el0t_64_sync_handler+0x100/0x130 el0t_64_sync+0x190/0x198 [Thread 2] INFO: task stress:1825 blocked for more than 30 seconds. Tainted: G OE 6.10.0-rc7+ #6 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:stress state:D stack:0 pid:1825 tgid:1825 ppid:1822 flags:0x0000000c Call trace: __switch_to+0xec/0x138 __schedule+0x43c/0xcb0 schedule+0x54/0x198 io_schedule+0x44/0x70 folio_wait_bit_common+0x184/0x3f8 <-- folio = 0xfffffdffc6b503c0 (mapping == 0xffff0000d184f1d8 index == 1711) (*) __folio_lock+0x24/0x38 z_erofs_runqueue+0x384/0x9c0 [erofs] z_erofs_readahead+0x21c/0x350 [erofs] <-- folio mapping 0xffff00036d69cb18 range from [992, 1024] (**) read_pages+0x74/0x328 page_cache_ra_order+0x26c/0x348 ondemand_readahead+0x1c0/0x3a0 page_cache_sync_ra+0x9c/0xc0 filemap_get_pages+0xc4/0x708 filemap_read+0x104/0x3a8 generic_file_read_iter+0x4c/0x150 vfs_read+0x27c/0x330 ksys_pread64+0x84/0xd0 __arm64_sys_pread64+0x28/0x40 invoke_syscall+0x78/0x108 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x3c/0x148 el0t_64_sync_handler+0x100/0x130 el0t_64_sync+0x190/0x198 Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") Cc: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> --- mm/migrate.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)