Message ID | 04997e54c276eff40a6119a90d36a4e71aade89c.1735806921.git.baolin.wang@linux.alibaba.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mm: shmem: skip swapcache for swapin of synchronous swap device | expand |
On Thu, Jan 02, 2025 at 04:40:17PM +0800, Baolin Wang wrote: > With fast swap devices (such as zram), swapin latency is crucial to applications. > For shmem swapin, similar to anonymous memory swapin, we can skip the swapcache > operation to improve swapin latency. OK, but now we have more complexity. Why can't we always skip the swapcache on swapin? (Actually, I think we can always skip the swapcache on swapout too, but that's a different matter). > +static struct folio *shmem_swap_alloc_folio(struct inode *inode, struct vm_area_struct *vma, > + pgoff_t index, swp_entry_t entry, int order, gfp_t gfp) Please wrap at 80 columns and use two tabs for indenting subsequent lines. ie: static struct folio *shmem_swap_alloc_folio(struct inode *inode, struct vm_area_struct *vma, pgoff_t index, swp_entry_t entry, int order, gfp_t gfp)
On 2025/1/2 21:10, Matthew Wilcox wrote: > On Thu, Jan 02, 2025 at 04:40:17PM +0800, Baolin Wang wrote: >> With fast swap devices (such as zram), swapin latency is crucial to applications. >> For shmem swapin, similar to anonymous memory swapin, we can skip the swapcache >> operation to improve swapin latency. > > OK, but now we have more complexity. Why can't we always skip the > swapcache on swapin? Skipping swapcache is used to swap-in shmem large folios, avoiding the large folios being split. Meanwhile, since the IO latency of syncing swap devices is relatively small, it won't cause the IO latency amplification issue. But for async swap devices, if we swap-in the large folio one-time, I am afraid the IO latency can be amplified. And I remember we still haven't reached an agreement here[1], so let's step by step and start with the sync swap devices first. [1] https://lore.kernel.org/linux-mm/874j7zfqkk.fsf@yhuang6-desk2.ccr.corp.intel.com/ (Actually, I think we can always skip the > swapcache on swapout too, but that's a different matter). Good suggestion. I will have a detail look. >> +static struct folio *shmem_swap_alloc_folio(struct inode *inode, struct vm_area_struct *vma, >> + pgoff_t index, swp_entry_t entry, int order, gfp_t gfp) > > Please wrap at 80 columns and use two tabs for indenting subsequent > lines. ie: > > static struct folio *shmem_swap_alloc_folio(struct inode *inode, > struct vm_area_struct *vma, pgoff_t index, swp_entry_t entry, > int order, gfp_t gfp) Sure. Thanks.
On Mon, Jan 06, 2025 at 11:46:04AM +0800, Baolin Wang wrote: > On 2025/1/2 21:10, Matthew Wilcox wrote: > > On Thu, Jan 02, 2025 at 04:40:17PM +0800, Baolin Wang wrote: > > > With fast swap devices (such as zram), swapin latency is crucial to applications. > > > For shmem swapin, similar to anonymous memory swapin, we can skip the swapcache > > > operation to improve swapin latency. > > > > OK, but now we have more complexity. Why can't we always skip the > > swapcache on swapin? > > Skipping swapcache is used to swap-in shmem large folios, avoiding the large > folios being split. Meanwhile, since the IO latency of syncing swap devices > is relatively small, it won't cause the IO latency amplification issue. > > But for async swap devices, if we swap-in the large folio one-time, I am > afraid the IO latency can be amplified. And I remember we still haven't > reached an agreement here[1], so let's step by step and start with the sync > swap devices first. Regardless of whether we choose to swap-in an order-0 or a large folio, my point is that we should always do it to the pagecache rather than the swap cache.
On 2025/1/6 12:07, Matthew Wilcox wrote: > On Mon, Jan 06, 2025 at 11:46:04AM +0800, Baolin Wang wrote: >> On 2025/1/2 21:10, Matthew Wilcox wrote: >>> On Thu, Jan 02, 2025 at 04:40:17PM +0800, Baolin Wang wrote: >>>> With fast swap devices (such as zram), swapin latency is crucial to applications. >>>> For shmem swapin, similar to anonymous memory swapin, we can skip the swapcache >>>> operation to improve swapin latency. >>> >>> OK, but now we have more complexity. Why can't we always skip the >>> swapcache on swapin? >> >> Skipping swapcache is used to swap-in shmem large folios, avoiding the large >> folios being split. Meanwhile, since the IO latency of syncing swap devices >> is relatively small, it won't cause the IO latency amplification issue. >> >> But for async swap devices, if we swap-in the large folio one-time, I am >> afraid the IO latency can be amplified. And I remember we still haven't >> reached an agreement here[1], so let's step by step and start with the sync >> swap devices first. > > Regardless of whether we choose to swap-in an order-0 or a large folio, > my point is that we should always do it to the pagecache rather than the > swap cache. IMO, this would miss the swap readahead algorithm in the swap case, which can benefit the order-0 swap-in. We need more work to ensure that skipping swapcache is helpful for all cases, which is why I'm starting with sync swap devices first.
On 2025/1/6 12:59, Baolin Wang wrote: > > > On 2025/1/6 12:07, Matthew Wilcox wrote: >> On Mon, Jan 06, 2025 at 11:46:04AM +0800, Baolin Wang wrote: >>> On 2025/1/2 21:10, Matthew Wilcox wrote: >>>> On Thu, Jan 02, 2025 at 04:40:17PM +0800, Baolin Wang wrote: >>>>> With fast swap devices (such as zram), swapin latency is crucial to >>>>> applications. >>>>> For shmem swapin, similar to anonymous memory swapin, we can skip >>>>> the swapcache >>>>> operation to improve swapin latency. >>>> >>>> OK, but now we have more complexity. Why can't we always skip the >>>> swapcache on swapin? >>> >>> Skipping swapcache is used to swap-in shmem large folios, avoiding >>> the large >>> folios being split. Meanwhile, since the IO latency of syncing swap >>> devices >>> is relatively small, it won't cause the IO latency amplification issue. >>> >>> But for async swap devices, if we swap-in the large folio one-time, I am >>> afraid the IO latency can be amplified. And I remember we still haven't >>> reached an agreement here[1], so let's step by step and start with >>> the sync >>> swap devices first. >> >> Regardless of whether we choose to swap-in an order-0 or a large folio, >> my point is that we should always do it to the pagecache rather than the >> swap cache. > > IMO, this would miss the swap readahead algorithm in the swap case, > which can benefit the order-0 swap-in. We need more work to ensure that > skipping swapcache is helpful for all cases, which is why I'm starting > with sync swap devices first. BTW, I used the SSD swap device to test the performance of skipping swapcache with the following hack changes, and I found that the performance of order-0 sequential swap-in shows a significant regression. Without the following changes: 1G order-0 shmem swap-in: 8056 ms With the following changes (skip swapcache): 1G order-0 shmem swap-in: 38536 ms diff --git a/mm/page_io.c b/mm/page_io.c index 9b983de351f9..1e22dedcd584 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -620,7 +620,6 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug) unsigned long pflags; bool in_thrashing; - VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(folio_test_uptodate(folio), folio); diff --git a/mm/shmem.c b/mm/shmem.c index e82ef1ef1c68..2902d3477520 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2295,7 +2295,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, fallback_order0 = true; /* Skip swapcache for synchronous device. */ - if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) { + if (!fallback_order0) { folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp); if (!IS_ERR(folio)) { skip_swapcache = true;
diff --git a/mm/shmem.c b/mm/shmem.c index 95b80c24f6f9..e82ef1ef1c68 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1967,6 +1967,66 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf, return ERR_PTR(error); } +static struct folio *shmem_swap_alloc_folio(struct inode *inode, struct vm_area_struct *vma, + pgoff_t index, swp_entry_t entry, int order, gfp_t gfp) +{ + struct shmem_inode_info *info = SHMEM_I(inode); + struct folio *new; + void *shadow; + int nr_pages; + + /* + * We have arrived here because our zones are constrained, so don't + * limit chance of success by further cpuset and node constraints. + */ + gfp &= ~GFP_CONSTRAINT_MASK; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (order > 0) { + gfp_t huge_gfp = vma_thp_gfp_mask(vma); + + gfp = limit_gfp_mask(huge_gfp, gfp); + } +#endif + + new = shmem_alloc_folio(gfp, order, info, index); + if (!new) + return ERR_PTR(-ENOMEM); + + nr_pages = folio_nr_pages(new); + if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL, + gfp, entry)) { + folio_put(new); + return ERR_PTR(-ENOMEM); + } + + /* + * Prevent parallel swapin from proceeding with the swap cache flag. + * + * Of course there is another possible concurrent scenario as well, + * that is to say, the swap cache flag of a large folio has already + * been set by swapcache_prepare(), while another thread may have + * already split the large swap entry stored in the shmem mapping. + * In this case, shmem_add_to_page_cache() will help identify the + * concurrent swapin and return -EEXIST. + */ + if (swapcache_prepare(entry, nr_pages)) { + folio_put(new); + return ERR_PTR(-EEXIST); + } + + __folio_set_locked(new); + __folio_set_swapbacked(new); + new->swap = entry; + + mem_cgroup_swapin_uncharge_swap(entry, nr_pages); + shadow = get_shadow_from_swap_cache(entry); + if (shadow) + workingset_refault(new, shadow); + folio_add_lru(new); + swap_read_folio(new, NULL); + return new; +} + /* * When a page is moved from swapcache to shmem filecache (either by the * usual swapin of shmem_get_folio_gfp(), or by the less common swapoff of @@ -2070,7 +2130,8 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp, } static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index, - struct folio *folio, swp_entry_t swap) + struct folio *folio, swp_entry_t swap, + bool skip_swapcache) { struct address_space *mapping = inode->i_mapping; swp_entry_t swapin_error; @@ -2086,7 +2147,8 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index, nr_pages = folio_nr_pages(folio); folio_wait_writeback(folio); - delete_from_swap_cache(folio); + if (!skip_swapcache) + delete_from_swap_cache(folio); /* * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks) @@ -2190,6 +2252,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, struct shmem_inode_info *info = SHMEM_I(inode); struct swap_info_struct *si; struct folio *folio = NULL; + bool skip_swapcache = false; swp_entry_t swap; int error, nr_pages; @@ -2211,6 +2274,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, /* Look it up and read it in.. */ folio = swap_cache_get_folio(swap, NULL, 0); if (!folio) { + int order = xa_get_order(&mapping->i_pages, index); + bool fallback_order0 = false; int split_order; /* Or update major stats only when swapin succeeds?? */ @@ -2220,6 +2285,33 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, count_memcg_event_mm(fault_mm, PGMAJFAULT); } + /* + * If uffd is active for the vma, we need per-page fault + * fidelity to maintain the uffd semantics, then fallback + * to swapin order-0 folio, as well as for zswap case. + */ + if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma))) || + !zswap_never_enabled())) + fallback_order0 = true; + + /* Skip swapcache for synchronous device. */ + if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) { + folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp); + if (!IS_ERR(folio)) { + skip_swapcache = true; + goto alloced; + } + + /* + * Fallback to swapin order-0 folio unless the swap entry + * already exists. + */ + error = PTR_ERR(folio); + folio = NULL; + if (error == -EEXIST) + goto failed; + } + /* * Now swap device can only swap in order 0 folio, then we * should split the large swap entry stored in the pagecache @@ -2250,9 +2342,10 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, } } +alloced: /* We have to do this with folio locked to prevent races */ folio_lock(folio); - if (!folio_test_swapcache(folio) || + if ((!skip_swapcache && !folio_test_swapcache(folio)) || folio->swap.val != swap.val || !shmem_confirm_swap(mapping, index, swap)) { error = -EEXIST; @@ -2288,7 +2381,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, if (sgp == SGP_WRITE) folio_mark_accessed(folio); - delete_from_swap_cache(folio); + if (skip_swapcache) { + folio->swap.val = 0; + swapcache_clear(si, swap, nr_pages); + } else { + delete_from_swap_cache(folio); + } folio_mark_dirty(folio); swap_free_nr(swap, nr_pages); put_swap_device(si); @@ -2299,8 +2397,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, if (!shmem_confirm_swap(mapping, index, swap)) error = -EEXIST; if (error == -EIO) - shmem_set_folio_swapin_error(inode, index, folio, swap); + shmem_set_folio_swapin_error(inode, index, folio, swap, + skip_swapcache); unlock: + if (skip_swapcache) + swapcache_clear(si, swap, folio_nr_pages(folio)); if (folio) { folio_unlock(folio); folio_put(folio);
With fast swap devices (such as zram), swapin latency is crucial to applications. For shmem swapin, similar to anonymous memory swapin, we can skip the swapcache operation to improve swapin latency. Testing 1G shmem sequential swapin without THP enabled, I observed approximately a 6% performance improvement: (Note: I repeated 5 times and took the mean data for each test) w/o patch w/ patch changes 534.8ms 501ms +6.3% In addition, currently, we always split the large swap entry stored in the shmem mapping during shmem large folio swapin, which is not perfect, especially with a fast swap device. We should swap in the whole large folio instead of splitting the precious large folios to take advantage of the large folios and improve the swapin latency if the swap device is synchronous device, which is similar to anonymous memory mTHP swapin. Testing 1G shmem sequential swapin with 64K mTHP and 2M mTHP, I observed obvious performance improvement: mTHP=64K w/o patch w/ patch changes 550.4ms 169.6ms +69% mTHP=2M w/o patch w/ patch changes 542.8ms 126.8ms +77% Note that skipping swapcache requires attention to concurrent swapin scenarios. Fortunately the swapcache_prepare() and shmem_add_to_page_cache() can help identify concurrent swapin and large swap entry split scenarios, and return -EEXIST for retry. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> --- Changes from v1: - Skip calling delete_from_swap_cache() in shmem_set_folio_swapin_error() when skipping the swapcache. --- mm/shmem.c | 111 ++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 106 insertions(+), 5 deletions(-)