Message ID | 20240821074541.516249-3-hanchuanhua@oppo.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: Ignite large folios swap-in support | expand |
On Wed, Aug 21, 2024 at 03:45:40PM GMT, hanchuanhua@oppo.com wrote: > From: Chuanhua Han <hanchuanhua@oppo.com> > > > 3. With both mTHP swap-out and swap-in supported, we offer the option to enable > zsmalloc compression/decompression with larger granularity[2]. The upcoming > optimization in zsmalloc will significantly increase swap speed and improve > compression efficiency. Tested by running 100 iterations of swapping 100MiB > of anon memory, the swap speed improved dramatically: > time consumption of swapin(ms) time consumption of swapout(ms) > lz4 4k 45274 90540 > lz4 64k 22942 55667 > zstdn 4k 85035 186585 > zstdn 64k 46558 118533 Are the above number with the patch series at [2] or without? Also can you explain your experiment setup or how can someone reproduce these? > [2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
On Thu, Aug 22, 2024 at 1:31 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Wed, Aug 21, 2024 at 03:45:40PM GMT, hanchuanhua@oppo.com wrote: > > From: Chuanhua Han <hanchuanhua@oppo.com> > > > > > > 3. With both mTHP swap-out and swap-in supported, we offer the option to enable > > zsmalloc compression/decompression with larger granularity[2]. The upcoming > > optimization in zsmalloc will significantly increase swap speed and improve > > compression efficiency. Tested by running 100 iterations of swapping 100MiB > > of anon memory, the swap speed improved dramatically: > > time consumption of swapin(ms) time consumption of swapout(ms) > > lz4 4k 45274 90540 > > lz4 64k 22942 55667 > > zstdn 4k 85035 186585 > > zstdn 64k 46558 118533 > > Are the above number with the patch series at [2] or without? Also can > you explain your experiment setup or how can someone reproduce these? Hi Shakeel, The data was recorded after applying both this patch (swap-in mTHP) and patch [2] (compressing/decompressing mTHP instead of page). However, without the swap-in series, patch [2] becomes useless because: If we have a large object, such as 16 pages in zsmalloc: do_swap_page will happen 16 times: 1. decompress the whole large object and copy one page; 2. decompress the whole large object and copy one page; 3. decompress the whole large object and copy one page; .... 16. decompress the whole large object and copy one page; So, patchset [2] will actually degrade performance rather than enhance it if we don't have this swap-in series. This swap-in series is a prerequisite for the zsmalloc/zram series. We reproduced the data through the following simple steps: 1. Collected anonymous pages from a running phone and saved them to a file. 2. Used a small program to open and read the file into a mapped anonymous memory. 3. Do the belows in the small program: swapout_start_time madv_pageout() swapout_end_time swapin_start_time read_data() swapin_end_time We calculate the throughput of swapout and swapin using the difference between end_time and start_time. Additionally, we record the memory usage of zram after the swapout is complete. > > > [2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/ > Thanks Barry
Hi Barry, On Thu, Aug 22, 2024 at 05:13:06AM GMT, Barry Song wrote: > On Thu, Aug 22, 2024 at 1:31 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Wed, Aug 21, 2024 at 03:45:40PM GMT, hanchuanhua@oppo.com wrote: > > > From: Chuanhua Han <hanchuanhua@oppo.com> > > > > > > > > > 3. With both mTHP swap-out and swap-in supported, we offer the option to enable > > > zsmalloc compression/decompression with larger granularity[2]. The upcoming > > > optimization in zsmalloc will significantly increase swap speed and improve > > > compression efficiency. Tested by running 100 iterations of swapping 100MiB > > > of anon memory, the swap speed improved dramatically: > > > time consumption of swapin(ms) time consumption of swapout(ms) > > > lz4 4k 45274 90540 > > > lz4 64k 22942 55667 > > > zstdn 4k 85035 186585 > > > zstdn 64k 46558 118533 > > > > Are the above number with the patch series at [2] or without? Also can > > you explain your experiment setup or how can someone reproduce these? > > Hi Shakeel, > > The data was recorded after applying both this patch (swap-in mTHP) and > patch [2] (compressing/decompressing mTHP instead of page). However, > without the swap-in series, patch [2] becomes useless because: > > If we have a large object, such as 16 pages in zsmalloc: > do_swap_page will happen 16 times: > 1. decompress the whole large object and copy one page; > 2. decompress the whole large object and copy one page; > 3. decompress the whole large object and copy one page; > .... > 16. decompress the whole large object and copy one page; > > So, patchset [2] will actually degrade performance rather than > enhance it if we don't have this swap-in series. This swap-in > series is a prerequisite for the zsmalloc/zram series. Thanks for the explanation. > > We reproduced the data through the following simple steps: > 1. Collected anonymous pages from a running phone and saved them to a file. > 2. Used a small program to open and read the file into a mapped anonymous > memory. > 3. Do the belows in the small program: > swapout_start_time > madv_pageout() > swapout_end_time > > swapin_start_time > read_data() > swapin_end_time > > We calculate the throughput of swapout and swapin using the difference between > end_time and start_time. Additionally, we record the memory usage of zram after > the swapout is complete. > Please correct me if I am wrong but you are saying in your experiment, 100 MiB took 90540 ms to compress/swapout and 45274 ms to decompress/swapin if backed by 4k pages but took 55667 ms and 22942 ms if backed by 64k pages. Basically the table shows total time to compress or decomress 100 MiB of memory, right? > > > > > [2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/ > > > > Thanks > Barry
On Sat, Aug 24, 2024 at 5:56 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > Hi Barry, > > On Thu, Aug 22, 2024 at 05:13:06AM GMT, Barry Song wrote: > > On Thu, Aug 22, 2024 at 1:31 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > On Wed, Aug 21, 2024 at 03:45:40PM GMT, hanchuanhua@oppo.com wrote: > > > > From: Chuanhua Han <hanchuanhua@oppo.com> > > > > > > > > > > > > 3. With both mTHP swap-out and swap-in supported, we offer the option to enable > > > > zsmalloc compression/decompression with larger granularity[2]. The upcoming > > > > optimization in zsmalloc will significantly increase swap speed and improve > > > > compression efficiency. Tested by running 100 iterations of swapping 100MiB > > > > of anon memory, the swap speed improved dramatically: > > > > time consumption of swapin(ms) time consumption of swapout(ms) > > > > lz4 4k 45274 90540 > > > > lz4 64k 22942 55667 > > > > zstdn 4k 85035 186585 > > > > zstdn 64k 46558 118533 > > > > > > Are the above number with the patch series at [2] or without? Also can > > > you explain your experiment setup or how can someone reproduce these? > > > > Hi Shakeel, > > > > The data was recorded after applying both this patch (swap-in mTHP) and > > patch [2] (compressing/decompressing mTHP instead of page). However, > > without the swap-in series, patch [2] becomes useless because: > > > > If we have a large object, such as 16 pages in zsmalloc: > > do_swap_page will happen 16 times: > > 1. decompress the whole large object and copy one page; > > 2. decompress the whole large object and copy one page; > > 3. decompress the whole large object and copy one page; > > .... > > 16. decompress the whole large object and copy one page; > > > > So, patchset [2] will actually degrade performance rather than > > enhance it if we don't have this swap-in series. This swap-in > > series is a prerequisite for the zsmalloc/zram series. > > Thanks for the explanation. > > > > > We reproduced the data through the following simple steps: > > 1. Collected anonymous pages from a running phone and saved them to a file. > > 2. Used a small program to open and read the file into a mapped anonymous > > memory. > > 3. Do the belows in the small program: > > swapout_start_time > > madv_pageout() > > swapout_end_time > > > > swapin_start_time > > read_data() > > swapin_end_time > > > > We calculate the throughput of swapout and swapin using the difference between > > end_time and start_time. Additionally, we record the memory usage of zram after > > the swapout is complete. > > > > Please correct me if I am wrong but you are saying in your experiment, > 100 MiB took 90540 ms to compress/swapout and 45274 ms to > decompress/swapin if backed by 4k pages but took 55667 ms and 22942 ms > if backed by 64k pages. Basically the table shows total time to compress > or decomress 100 MiB of memory, right? Hi Shakeel, Tangquan(CC'd) collected the data and double-checked the case to confirm the answer to your question. We have three cases: 1. no mTHP swap-in, no zsmalloc/zram multi-pages compression/decompression 2. have mTHP swap-in, no zsmalloc/zram multi-pages compression/decompression 3. have mTHP swap-in, have zsmalloc/zram multi-pages compression/decompression The data was 1 vs 3. To provide more precise data that covers each change, Tangquan tested 1 vs. 2 and 2 vs. 3 yesterday using LZ4 (the hardware might differ from the previous test, but the data shows the same trend) per my request. 1. no mTHP swapin, no zsmalloc/zram patch swapin_ms. 30336 swapout_ms. 65651 2. have mTHP swapin, no zsmalloc/zram patch swapin_ms. 27161 swapout_ms. 61135 3. have mTHP swapin, have zsmalloc/zram patch swapin_ms. 13683 swapout_ms. 43305 The test pseudocode is as follows: addr=mmap(100M) read_anon_data_from_file_to addr(); for(i=0;i<100;i++) { swapout_start_time; madv_pageout(); swapout_end_time; swapin_start_time; read_addr_to_swapin(); swapin_end_time; } So, while we saw some improvement from 1 to 2, the significant gains come from using large blocks for compression and decompression. This mTHP swap-in series ensures that mTHPs aren't lost after the first swap-in, so the following 99 iterations continue to involve THP swap-out and mTHP swap-in. The improvement from 1 to 2 is due to this mTHP swap-in series, while the improvement from 2 to 3 comes from the zsmalloc/zram patchset [2] you mentioned. [2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/ > > > > > Thanks Barry
Hi Shakeel, We submitted an RFC patchset [1] with the Intel In-Memory Analytics Accelerator (Intel IAA) sometime back. This introduces a new 'canned-by_n' compression algorithm in the IAA crypto driver. Relative to software compressors, we could get a 10X improvement in zram write latency and 7X improvement in zram read latency. [1] https://lore.kernel.org/all/cover.1714581792.git.andre.glover@linux.intel.com/ Thanks, Kanchana
On Thu, Aug 29, 2024 at 1:01 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > Hi Shakeel, > > We submitted an RFC patchset [1] with the Intel In-Memory Analytics > Accelerator (Intel IAA) sometime back. This introduces a new 'canned-by_n' > compression algorithm in the IAA crypto driver. > > Relative to software compressors, we could get a 10X improvement in zram > write latency and 7X improvement in zram read latency. > > [1] https://lore.kernel.org/all/cover.1714581792.git.andre.glover@linux.intel.com/ Hi Kanchana, Thanks for sharing. I understand you’ll need this mTHP swap-in series to leverage your IAA for parallel decompression, right? Without mTHP swap-in, you won't get this 7X improvement, right? This is another important use case for the mTHP swap-in series, highlighting the strong need to start the work from the sync IO device. I’ll try to find some time to review your patch and explore how we can better support both software and hardware improvements in zsmalloc/zram with a more compatible approach. Also, I have a talk[1] at LPC2024—would you mind if I include a description of your use case? [1] https://lpc.events/event/18/contributions/1780/ > > Thanks, > Kanchana Thanks Barry
Hi Barry, > -----Original Message----- > From: Barry Song <21cnbao@gmail.com> > Sent: Wednesday, August 28, 2024 7:25 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: akpm@linux-foundation.org; baolin.wang@linux.alibaba.com; > chrisl@kernel.org; david@redhat.com; hanchuanhua@oppo.com; > hannes@cmpxchg.org; hch@infradead.org; hughd@google.com; > kaleshsingh@google.com; kasong@tencent.com; linux- > kernel@vger.kernel.org; linux-mm@kvack.org; mhocko@suse.com; > minchan@kernel.org; nphamcs@gmail.com; ryan.roberts@arm.com; > ryncsn@gmail.com; senozhatsky@chromium.org; shakeel.butt@linux.dev; > shy828301@gmail.com; surenb@google.com; v-songbaohua@oppo.com; > willy@infradead.org; xiang@kernel.org; Huang, Ying > <ying.huang@intel.com>; yosryahmed@google.com; > zhengtangquan@oppo.com; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; > Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 2/2] mm: support large folios swap-in for sync io > devices > > On Thu, Aug 29, 2024 at 1:01 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > Hi Shakeel, > > > > We submitted an RFC patchset [1] with the Intel In-Memory Analytics > > Accelerator (Intel IAA) sometime back. This introduces a new 'canned-by_n' > > compression algorithm in the IAA crypto driver. > > > > Relative to software compressors, we could get a 10X improvement in zram > > write latency and 7X improvement in zram read latency. > > > > [1] > https://lore.kernel.org/all/cover.1714581792.git.andre.glover@linux.intel.co > m/ > > Hi Kanchana, > Thanks for sharing. I understand you’ll need this mTHP swap-in series > to leverage your > IAA for parallel decompression, right? Without mTHP swap-in, you won't > get this 7X > improvement, right? Yes, that is correct. > > This is another important use case for the mTHP swap-in series, > highlighting the strong > need to start the work from the sync IO device. Sure, this makes sense! > > I’ll try to find some time to review your patch and explore how we can > better support both > software and hardware improvements in zsmalloc/zram with a more > compatible approach. > Also, I have a talk[1] at LPC2024—would you mind if I include a > description of your use > case? Sure, this sounds good. Thanks, Kanchana > > [1] https://lpc.events/event/18/contributions/1780/ > > > > > Thanks, > > Kanchana > > Thanks > Barry
Hi All, On Wed, Aug 21, 2024 at 3:46 PM <hanchuanhua@oppo.com> wrote: > > From: Chuanhua Han <hanchuanhua@oppo.com> > > Currently, we have mTHP features, but unfortunately, without support for > large folio swap-ins, once these large folios are swapped out, they are > lost because mTHP swap is a one-way process. The lack of mTHP swap-in > functionality prevents mTHP from being used on devices like Android that > heavily rely on swap. > > This patch introduces mTHP swap-in support. It starts from sync devices > such as zRAM. This is probably the simplest and most common use case, > benefiting billions of Android phones and similar devices with minimal > implementation cost. In this straightforward scenario, large folios are > always exclusive, eliminating the need to handle complex rmap and > swapcache issues. > > It offers several benefits: > 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after > swap-out and swap-in. Large folios in the buddy system are also > preserved as much as possible, rather than being fragmented due > to swap-in. > > 2. Eliminates fragmentation in swap slots and supports successful > THP_SWPOUT. > > w/o this patch (Refer to the data from Chris's and Kairui's latest > swap allocator optimization while running ./thp_swap_allocator_test > w/o "-a" option [1]): > > ./thp_swap_allocator_test > Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53% > Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58% > Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34% > Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51% > Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84% > Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91% > Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05% > Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25% > Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74% > Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01% > Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45% > Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98% > Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64% > Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36% > Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02% > Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07% > > w/ this patch (always 0%): > Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00% > ... > > 3. With both mTHP swap-out and swap-in supported, we offer the option to enable > zsmalloc compression/decompression with larger granularity[2]. The upcoming > optimization in zsmalloc will significantly increase swap speed and improve > compression efficiency. Tested by running 100 iterations of swapping 100MiB > of anon memory, the swap speed improved dramatically: > time consumption of swapin(ms) time consumption of swapout(ms) > lz4 4k 45274 90540 > lz4 64k 22942 55667 > zstdn 4k 85035 186585 > zstdn 64k 46558 118533 > > The compression ratio also improved, as evaluated with 1 GiB of data: > granularity orig_data_size compr_data_size > 4KiB-zstd 1048576000 246876055 > 64KiB-zstd 1048576000 199763892 > > Without mTHP swap-in, the potential optimizations in zsmalloc cannot be > realized. > > 4. Even mTHP swap-in itself can reduce swap-in page faults by a factor > of nr_pages. Swapping in content filled with the same data 0x11, w/o > and w/ the patch for five rounds (Since the content is the same, > decompression will be very fast. This primarily assesses the impact of > reduced page faults): > > swp in bandwidth(bytes/ms) w/o w/ > round1 624152 1127501 > round2 631672 1127501 > round3 620459 1139756 > round4 606113 1139756 > round5 624152 1152281 > avg 621310 1137359 +83% > > [1] https://lore.kernel.org/all/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/ > [2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/ > > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com> > Co-developed-by: Barry Song <v-songbaohua@oppo.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 223 insertions(+), 27 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index b9fe2f354878..7aa0358a4160 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3986,6 +3986,184 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) > return VM_FAULT_SIGBUS; > } > > +static struct folio *__alloc_swap_folio(struct vm_fault *vmf) > +{ > + struct vm_area_struct *vma = vmf->vma; > + struct folio *folio; > + swp_entry_t entry; > + > + folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, > + vmf->address, false); > + if (!folio) > + return NULL; > + > + entry = pte_to_swp_entry(vmf->orig_pte); > + if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, > + GFP_KERNEL, entry)) { > + folio_put(folio); > + return NULL; > + } > + > + return folio; > +} > + > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > +/* > + * Check if the PTEs within a range are contiguous swap entries > + * and have no cache when check_no_cache is true. > + */ > +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, > + int nr_pages, bool check_no_cache) > +{ > + struct swap_info_struct *si; > + unsigned long addr; > + swp_entry_t entry; > + pgoff_t offset; > + int idx, i; > + pte_t pte; > + > + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); > + idx = (vmf->address - addr) / PAGE_SIZE; > + pte = ptep_get(ptep); > + > + if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx))) > + return false; > + entry = pte_to_swp_entry(pte); > + offset = swp_offset(entry); > + if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages) > + return false; > + > + if (!check_no_cache) > + return true; > + > + si = swp_swap_info(entry); > + /* > + * While allocating a large folio and doing swap_read_folio, which is > + * the case the being faulted pte doesn't have swapcache. We need to > + * ensure all PTEs have no cache as well, otherwise, we might go to > + * swap devices while the content is in swapcache. > + */ > + for (i = 0; i < nr_pages; i++) { > + if ((si->swap_map[offset + i] & SWAP_HAS_CACHE)) > + return false; > + } > + > + return true; > +} > + > +static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset, > + unsigned long addr, > + unsigned long orders) > +{ > + int order, nr; > + > + order = highest_order(orders); > + > + /* > + * To swap in a THP with nr pages, we require that its first swap_offset > + * is aligned with that number, as it was when the THP was swapped out. > + * This helps filter out most invalid entries. > + */ > + while (orders) { > + nr = 1 << order; > + if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr) > + break; > + order = next_order(&orders, order); > + } > + > + return orders; > +} > + > +static struct folio *alloc_swap_folio(struct vm_fault *vmf) > +{ > + struct vm_area_struct *vma = vmf->vma; > + unsigned long orders; > + struct folio *folio; > + unsigned long addr; > + swp_entry_t entry; > + spinlock_t *ptl; > + pte_t *pte; > + gfp_t gfp; > + int order; > + > + /* > + * If uffd is active for the vma we need per-page fault fidelity to > + * maintain the uffd semantics. > + */ > + if (unlikely(userfaultfd_armed(vma))) > + goto fallback; > + > + /* > + * A large swapped out folio could be partially or fully in zswap. We > + * lack handling for such cases, so fallback to swapping in order-0 > + * folio. > + */ > + if (!zswap_never_enabled()) > + goto fallback; > + > + entry = pte_to_swp_entry(vmf->orig_pte); > + /* > + * Get a list of all the (large) orders below PMD_ORDER that are enabled > + * and suitable for swapping THP. > + */ > + orders = thp_vma_allowable_orders(vma, vma->vm_flags, > + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); > + orders = thp_vma_suitable_orders(vma, vmf->address, orders); > + orders = thp_swap_suitable_orders(swp_offset(entry), > + vmf->address, orders); > + > + if (!orders) > + goto fallback; > + > + pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, > + vmf->address & PMD_MASK, &ptl); > + if (unlikely(!pte)) > + goto fallback; > + > + /* > + * For do_swap_page, find the highest order where the aligned range is > + * completely swap entries with contiguous swap offsets. > + */ > + order = highest_order(orders); > + while (orders) { > + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > + if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order, true)) > + break; > + order = next_order(&orders, order); > + } > + > + pte_unmap_unlock(pte, ptl); > + > + /* Try allocating the highest of the remaining orders. */ > + gfp = vma_thp_gfp_mask(vma); > + while (orders) { > + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > + folio = vma_alloc_folio(gfp, order, vma, addr, true); > + if (folio) { > + if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, > + gfp, entry)) > + return folio; > + folio_put(folio); > + } > + order = next_order(&orders, order); > + } > + > +fallback: > + return __alloc_swap_folio(vmf); > +} > +#else /* !CONFIG_TRANSPARENT_HUGEPAGE */ > +static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, > + int nr_pages, bool check_no_cache) > +{ > + return false; > +} > + > +static struct folio *alloc_swap_folio(struct vm_fault *vmf) > +{ > + return __alloc_swap_folio(vmf); > +} > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > + > /* > * We enter with non-exclusive mmap_lock (to exclude vma changes, > * but allow concurrent faults), and pte mapped but not yet locked. > @@ -4074,34 +4252,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (!folio) { > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > __swap_count(entry) == 1) { > - /* > - * Prevent parallel swapin from proceeding with > - * the cache flag. Otherwise, another thread may > - * finish swapin first, free the entry, and swapout > - * reusing the same entry. It's undetectable as > - * pte_same() returns true due to entry reuse. > - */ > - if (swapcache_prepare(entry, 1)) { > - /* Relax a bit to prevent rapid repeated page faults */ > - schedule_timeout_uninterruptible(1); > - goto out; > - } > - need_clear_cache = true; > - > /* skip swapcache */ > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > - vma, vmf->address, false); > + folio = alloc_swap_folio(vmf); > if (folio) { > __folio_set_locked(folio); > __folio_set_swapbacked(folio); > > - if (mem_cgroup_swapin_charge_folio(folio, > - vma->vm_mm, GFP_KERNEL, > - entry)) { > - ret = VM_FAULT_OOM; > + nr_pages = folio_nr_pages(folio); > + if (folio_test_large(folio)) > + entry.val = ALIGN_DOWN(entry.val, nr_pages); > + /* > + * Prevent parallel swapin from proceeding with > + * the cache flag. Otherwise, another thread > + * may finish swapin first, free the entry, and > + * swapout reusing the same entry. It's > + * undetectable as pte_same() returns true due > + * to entry reuse. > + */ > + if (swapcache_prepare(entry, nr_pages)) { > + /* > + * Relax a bit to prevent rapid > + * repeated page faults. > + */ > + schedule_timeout_uninterruptible(1); > goto out_page; > } > - mem_cgroup_swapin_uncharge_swap(entry, 1); > + need_clear_cache = true; > + > + mem_cgroup_swapin_uncharge_swap(entry, nr_pages); > > shadow = get_shadow_from_swap_cache(entry); > if (shadow) > @@ -4207,6 +4385,23 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > goto out_nomap; > } > > + /* allocated large folios for SWP_SYNCHRONOUS_IO */ > + if (folio_test_large(folio) && !folio_test_swapcache(folio)) { > + unsigned long nr = folio_nr_pages(folio); > + unsigned long folio_start = ALIGN_DOWN(vmf->address, > + nr * PAGE_SIZE); > + unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE; > + pte_t *folio_ptep = vmf->pte - idx; > + > + if (!can_swapin_thp(vmf, folio_ptep, nr, false)) > + goto out_nomap; > + > + page_idx = idx; > + address = folio_start; > + ptep = folio_ptep; > + goto check_folio; > + } > + > nr_pages = 1; > page_idx = 0; > address = vmf->address; > @@ -4338,11 +4533,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > folio_add_lru_vma(folio, vma); > } else if (!folio_test_anon(folio)) { > /* > - * We currently only expect small !anon folios, which are either > - * fully exclusive or fully shared. If we ever get large folios > - * here, we have to be careful. > + * We currently only expect small !anon folios which are either > + * fully exclusive or fully shared, or new allocated large > + * folios which are fully exclusive. If we ever get large > + * folios within swapcache here, we have to be careful. > */ > - VM_WARN_ON_ONCE(folio_test_large(folio)); > + VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); > VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); > folio_add_new_anon_rmap(folio, vma, address, rmap_flags); > } else { > @@ -4385,7 +4581,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > out: > /* Clear the swap cache pin for direct swapin after PTL unlock */ > if (need_clear_cache) > - swapcache_clear(si, entry, 1); > + swapcache_clear(si, entry, nr_pages); > if (si) > put_swap_device(si); > return ret; > @@ -4401,7 +4597,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > folio_put(swapcache); > } > if (need_clear_cache) > - swapcache_clear(si, entry, 1); > + swapcache_clear(si, entry, nr_pages); > if (si) > put_swap_device(si); > return ret; > -- > 2.43.0 > With latest mm-unstable, I'm seeing following WARN followed by user space segfaults (multiple mTHP enabled): [ 39.145686] ------------[ cut here ]------------ [ 39.145969] WARNING: CPU: 24 PID: 11159 at mm/page_io.c:535 swap_read_folio+0x4db/0x520 [ 39.146307] Modules linked in: [ 39.146507] CPU: 24 UID: 1000 PID: 11159 Comm: sh Kdump: loaded Not tainted 6.11.0-rc6.orig+ #131 [ 39.146887] Hardware name: Tencent Cloud CVM, BIOS seabios-1.9.1-qemu-project.org 04/01/2014 [ 39.147206] RIP: 0010:swap_read_folio+0x4db/0x520 [ 39.147430] Code: 00 e0 ff ff 09 c1 83 f8 08 0f 42 d1 e9 c4 fe ff ff 48 63 85 34 02 00 00 48 03 45 08 49 39 c4 0f 85 63 fe ff ff e9 db fe ff ff <0f> 0b e9 91 fd ff ff 31 d2 e9 9d fe ff ff 48 c7 c6 38 b6 4e 82 48 [ 39.148079] RSP: 0000:ffffc900045c3ce0 EFLAGS: 00010202 [ 39.148390] RAX: 0017ffffd0020061 RBX: ffffea00064d4c00 RCX: 03ffffffffffffff [ 39.148737] RDX: ffffea00064d4c00 RSI: 0000000000000000 RDI: ffffea00064d4c00 [ 39.149102] RBP: 0000000000000001 R08: ffffea00064d4c00 R09: 0000000000000078 [ 39.149482] R10: 00000000000000f0 R11: 0000000000000004 R12: 0000000000001000 [ 39.149832] R13: ffff888102df5c00 R14: ffff888102df5c00 R15: 0000000000000003 [ 39.150177] FS: 00007f51a56c9540(0000) GS:ffff888fffc00000(0000) knlGS:0000000000000000 [ 39.150623] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 39.150950] CR2: 000055627b13fda0 CR3: 00000001083e2000 CR4: 00000000003506b0 [ 39.151317] Call Trace: [ 39.151565] <TASK> [ 39.151778] ? __warn+0x84/0x130 [ 39.152044] ? swap_read_folio+0x4db/0x520 [ 39.152345] ? report_bug+0xfc/0x1e0 [ 39.152614] ? handle_bug+0x3f/0x70 [ 39.152891] ? exc_invalid_op+0x17/0x70 [ 39.153178] ? asm_exc_invalid_op+0x1a/0x20 [ 39.153467] ? swap_read_folio+0x4db/0x520 [ 39.153753] do_swap_page+0xc6d/0x14f0 [ 39.154054] ? srso_return_thunk+0x5/0x5f [ 39.154361] __handle_mm_fault+0x758/0x850 [ 39.154645] handle_mm_fault+0x134/0x340 [ 39.154945] do_user_addr_fault+0x2e5/0x760 [ 39.155245] exc_page_fault+0x6a/0x140 [ 39.155546] asm_exc_page_fault+0x26/0x30 [ 39.155847] RIP: 0033:0x55627b071446 [ 39.156124] Code: f6 7e 19 83 e3 01 74 14 41 83 ee 01 44 89 35 25 72 0c 00 45 85 ed 0f 88 73 02 00 00 8b 05 ea 74 0c 00 85 c0 0f 85 da 03 00 00 <44> 8b 15 53 e9 0c 00 45 85 d2 74 2e 44 8b 0d 37 e3 0c 00 45 85 c9 [ 39.156944] RSP: 002b:00007ffd619d54f0 EFLAGS: 00010246 [ 39.157237] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f51a44f968b [ 39.157594] RDX: 0000000000000000 RSI: 00007ffd619d5518 RDI: 00000000ffffffff [ 39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007 [ 39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 [ 39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518 [ 39.158998] </TASK> [ 39.159226] ---[ end trace 0000000000000000 ]--- After reverting this or Usama's "mm: store zero pages to be swapped out in a bitmap", the problem is gone. I think these two patches may have some conflict that needs to be resolved.
[..] > > With latest mm-unstable, I'm seeing following WARN followed by user > space segfaults (multiple mTHP enabled): > > [ 39.145686] ------------[ cut here ]------------ > [ 39.145969] WARNING: CPU: 24 PID: 11159 at mm/page_io.c:535 > swap_read_folio+0x4db/0x520 > [ 39.146307] Modules linked in: > [ 39.146507] CPU: 24 UID: 1000 PID: 11159 Comm: sh Kdump: loaded Not > tainted 6.11.0-rc6.orig+ #131 > [ 39.146887] Hardware name: Tencent Cloud CVM, BIOS > seabios-1.9.1-qemu-project.org 04/01/2014 > [ 39.147206] RIP: 0010:swap_read_folio+0x4db/0x520 > [ 39.147430] Code: 00 e0 ff ff 09 c1 83 f8 08 0f 42 d1 e9 c4 fe ff > ff 48 63 85 34 02 00 00 48 03 45 08 49 39 c4 0f 85 63 fe ff ff e9 db > fe ff ff <0f> 0b e9 91 fd ff ff 31 d2 e9 9d fe ff ff 48 c7 c6 38 b6 4e > 82 48 > [ 39.148079] RSP: 0000:ffffc900045c3ce0 EFLAGS: 00010202 > [ 39.148390] RAX: 0017ffffd0020061 RBX: ffffea00064d4c00 RCX: 03ffffffffffffff > [ 39.148737] RDX: ffffea00064d4c00 RSI: 0000000000000000 RDI: ffffea00064d4c00 > [ 39.149102] RBP: 0000000000000001 R08: ffffea00064d4c00 R09: 0000000000000078 > [ 39.149482] R10: 00000000000000f0 R11: 0000000000000004 R12: 0000000000001000 > [ 39.149832] R13: ffff888102df5c00 R14: ffff888102df5c00 R15: 0000000000000003 > [ 39.150177] FS: 00007f51a56c9540(0000) GS:ffff888fffc00000(0000) > knlGS:0000000000000000 > [ 39.150623] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 39.150950] CR2: 000055627b13fda0 CR3: 00000001083e2000 CR4: 00000000003506b0 > [ 39.151317] Call Trace: > [ 39.151565] <TASK> > [ 39.151778] ? __warn+0x84/0x130 > [ 39.152044] ? swap_read_folio+0x4db/0x520 > [ 39.152345] ? report_bug+0xfc/0x1e0 > [ 39.152614] ? handle_bug+0x3f/0x70 > [ 39.152891] ? exc_invalid_op+0x17/0x70 > [ 39.153178] ? asm_exc_invalid_op+0x1a/0x20 > [ 39.153467] ? swap_read_folio+0x4db/0x520 > [ 39.153753] do_swap_page+0xc6d/0x14f0 > [ 39.154054] ? srso_return_thunk+0x5/0x5f > [ 39.154361] __handle_mm_fault+0x758/0x850 > [ 39.154645] handle_mm_fault+0x134/0x340 > [ 39.154945] do_user_addr_fault+0x2e5/0x760 > [ 39.155245] exc_page_fault+0x6a/0x140 > [ 39.155546] asm_exc_page_fault+0x26/0x30 > [ 39.155847] RIP: 0033:0x55627b071446 > [ 39.156124] Code: f6 7e 19 83 e3 01 74 14 41 83 ee 01 44 89 35 25 > 72 0c 00 45 85 ed 0f 88 73 02 00 00 8b 05 ea 74 0c 00 85 c0 0f 85 da > 03 00 00 <44> 8b 15 53 e9 0c 00 45 85 d2 74 2e 44 8b 0d 37 e3 0c 00 45 > 85 c9 > [ 39.156944] RSP: 002b:00007ffd619d54f0 EFLAGS: 00010246 > [ 39.157237] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f51a44f968b > [ 39.157594] RDX: 0000000000000000 RSI: 00007ffd619d5518 RDI: 00000000ffffffff > [ 39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007 > [ 39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 > [ 39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518 > [ 39.158998] </TASK> > [ 39.159226] ---[ end trace 0000000000000000 ]--- > > After reverting this or Usama's "mm: store zero pages to be swapped > out in a bitmap", the problem is gone. I think these two patches may > have some conflict that needs to be resolved. Yup. I saw this conflict coming and specifically asked for this warning to be added in Usama's patch to catch it [1]. It served its purpose. Usama's patch does not handle large folio swapin, because at the time it was written we didn't have it. We expected Usama's series to land sooner than this one, so the warning was to make sure that this series handles large folio swapin in the zeromap code. Now that they are both in mm-unstable, we are gonna have to figure this out. I suspect Usama's patches are closer to land so it's better to handle this in this series, but I will leave it up to Usama and Chuanhua/Barry to figure this out :) [1]https://lore.kernel.org/lkml/CAJD7tkbpXjg00CRSrXU_pbaHwEaW1b3k8AQgu8y2PAh7EkTOug@mail.gmail.com/
On Tue, 3 Sep 2024 11:38:37 -0700 Yosry Ahmed <yosryahmed@google.com> wrote: > > [ 39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007 > > [ 39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 > > [ 39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518 > > [ 39.158998] </TASK> > > [ 39.159226] ---[ end trace 0000000000000000 ]--- > > > > After reverting this or Usama's "mm: store zero pages to be swapped > > out in a bitmap", the problem is gone. I think these two patches may > > have some conflict that needs to be resolved. > > Yup. I saw this conflict coming and specifically asked for this > warning to be added in Usama's patch to catch it [1]. It served its > purpose. > > Usama's patch does not handle large folio swapin, because at the time > it was written we didn't have it. We expected Usama's series to land > sooner than this one, so the warning was to make sure that this series > handles large folio swapin in the zeromap code. Now that they are both > in mm-unstable, we are gonna have to figure this out. > > I suspect Usama's patches are closer to land so it's better to handle > this in this series, but I will leave it up to Usama and > Chuanhua/Barry to figure this out :) > > [1]https://lore.kernel.org/lkml/CAJD7tkbpXjg00CRSrXU_pbaHwEaW1b3k8AQgu8y2PAh7EkTOug@mail.gmail.com/ Thanks. To unbreak -next I'll drop the two-patch series "mm: Ignite large folios swap-in support" for now. btw, next time can we please call it "enable large folios swap-in support"? "ignite" doesn't make much sense here.
On Wed, Sep 4, 2024 at 8:08 AM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Tue, 3 Sep 2024 11:38:37 -0700 Yosry Ahmed <yosryahmed@google.com> wrote: > > > > [ 39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007 > > > [ 39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 > > > [ 39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518 > > > [ 39.158998] </TASK> > > > [ 39.159226] ---[ end trace 0000000000000000 ]--- > > > > > > After reverting this or Usama's "mm: store zero pages to be swapped > > > out in a bitmap", the problem is gone. I think these two patches may > > > have some conflict that needs to be resolved. > > > > Yup. I saw this conflict coming and specifically asked for this > > warning to be added in Usama's patch to catch it [1]. It served its > > purpose. > > > > Usama's patch does not handle large folio swapin, because at the time > > it was written we didn't have it. We expected Usama's series to land > > sooner than this one, so the warning was to make sure that this series > > handles large folio swapin in the zeromap code. Now that they are both > > in mm-unstable, we are gonna have to figure this out. > > > > I suspect Usama's patches are closer to land so it's better to handle > > this in this series, but I will leave it up to Usama and > > Chuanhua/Barry to figure this out :) I believe handling this in swap-in might violate layer separation. `swap_read_folio()` should be a reliable API to call, regardless of whether `zeromap` is present. Therefore, the fix should likely be within `zeromap` but not this `swap-in`. I’ll take a look at this with Usama :-) > > > > [1]https://lore.kernel.org/lkml/CAJD7tkbpXjg00CRSrXU_pbaHwEaW1b3k8AQgu8y2PAh7EkTOug@mail.gmail.com/ > > Thanks. To unbreak -next I'll drop the two-patch series "mm: Ignite > large folios swap-in support" for now. > > btw, next time can we please call it "enable large folios swap-in > support"? "ignite" doesn't make much sense here. sure. > Thanks Barry
On Tue, Sep 3, 2024 at 2:36 PM Barry Song <21cnbao@gmail.com> wrote: > > On Wed, Sep 4, 2024 at 8:08 AM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > On Tue, 3 Sep 2024 11:38:37 -0700 Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > [ 39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007 > > > > [ 39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 > > > > [ 39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518 > > > > [ 39.158998] </TASK> > > > > [ 39.159226] ---[ end trace 0000000000000000 ]--- > > > > > > > > After reverting this or Usama's "mm: store zero pages to be swapped > > > > out in a bitmap", the problem is gone. I think these two patches may > > > > have some conflict that needs to be resolved. > > > > > > Yup. I saw this conflict coming and specifically asked for this > > > warning to be added in Usama's patch to catch it [1]. It served its > > > purpose. > > > > > > Usama's patch does not handle large folio swapin, because at the time > > > it was written we didn't have it. We expected Usama's series to land > > > sooner than this one, so the warning was to make sure that this series > > > handles large folio swapin in the zeromap code. Now that they are both > > > in mm-unstable, we are gonna have to figure this out. > > > > > > I suspect Usama's patches are closer to land so it's better to handle > > > this in this series, but I will leave it up to Usama and > > > Chuanhua/Barry to figure this out :) > > I believe handling this in swap-in might violate layer separation. > `swap_read_folio()` should be a reliable API to call, regardless of > whether `zeromap` is present. Therefore, the fix should likely be > within `zeromap` but not this `swap-in`. I’ll take a look at this with > Usama :-) I meant handling it within this series to avoid blocking Usama patches, not within this code. Thanks for taking a look, I am sure you and Usama will figure out the best way forward :)
On 03/09/2024 23:05, Yosry Ahmed wrote: > On Tue, Sep 3, 2024 at 2:36 PM Barry Song <21cnbao@gmail.com> wrote: >> >> On Wed, Sep 4, 2024 at 8:08 AM Andrew Morton <akpm@linux-foundation.org> wrote: >>> >>> On Tue, 3 Sep 2024 11:38:37 -0700 Yosry Ahmed <yosryahmed@google.com> wrote: >>> >>>>> [ 39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007 >>>>> [ 39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 >>>>> [ 39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518 >>>>> [ 39.158998] </TASK> >>>>> [ 39.159226] ---[ end trace 0000000000000000 ]--- >>>>> >>>>> After reverting this or Usama's "mm: store zero pages to be swapped >>>>> out in a bitmap", the problem is gone. I think these two patches may >>>>> have some conflict that needs to be resolved. >>>> >>>> Yup. I saw this conflict coming and specifically asked for this >>>> warning to be added in Usama's patch to catch it [1]. It served its >>>> purpose. >>>> >>>> Usama's patch does not handle large folio swapin, because at the time >>>> it was written we didn't have it. We expected Usama's series to land >>>> sooner than this one, so the warning was to make sure that this series >>>> handles large folio swapin in the zeromap code. Now that they are both >>>> in mm-unstable, we are gonna have to figure this out. >>>> >>>> I suspect Usama's patches are closer to land so it's better to handle >>>> this in this series, but I will leave it up to Usama and >>>> Chuanhua/Barry to figure this out :) >> >> I believe handling this in swap-in might violate layer separation. >> `swap_read_folio()` should be a reliable API to call, regardless of >> whether `zeromap` is present. Therefore, the fix should likely be >> within `zeromap` but not this `swap-in`. I’ll take a look at this with >> Usama :-) > > I meant handling it within this series to avoid blocking Usama > patches, not within this code. Thanks for taking a look, I am sure you > and Usama will figure out the best way forward :) Hi Barry and Yosry, Is the best (and quickest) way forward to have a v8 of this with https://lore.kernel.org/all/20240904055522.2376-1-21cnbao@gmail.com/ as the first patch, and using swap_zeromap_entries_count in alloc_swap_folio in this support large folios swap-in patch? Thanks, Usama
On Thu, Sep 5, 2024 at 9:30 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 03/09/2024 23:05, Yosry Ahmed wrote: > > On Tue, Sep 3, 2024 at 2:36 PM Barry Song <21cnbao@gmail.com> wrote: > >> > >> On Wed, Sep 4, 2024 at 8:08 AM Andrew Morton <akpm@linux-foundation.org> wrote: > >>> > >>> On Tue, 3 Sep 2024 11:38:37 -0700 Yosry Ahmed <yosryahmed@google.com> wrote: > >>> > >>>>> [ 39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007 > >>>>> [ 39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 > >>>>> [ 39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518 > >>>>> [ 39.158998] </TASK> > >>>>> [ 39.159226] ---[ end trace 0000000000000000 ]--- > >>>>> > >>>>> After reverting this or Usama's "mm: store zero pages to be swapped > >>>>> out in a bitmap", the problem is gone. I think these two patches may > >>>>> have some conflict that needs to be resolved. > >>>> > >>>> Yup. I saw this conflict coming and specifically asked for this > >>>> warning to be added in Usama's patch to catch it [1]. It served its > >>>> purpose. > >>>> > >>>> Usama's patch does not handle large folio swapin, because at the time > >>>> it was written we didn't have it. We expected Usama's series to land > >>>> sooner than this one, so the warning was to make sure that this series > >>>> handles large folio swapin in the zeromap code. Now that they are both > >>>> in mm-unstable, we are gonna have to figure this out. > >>>> > >>>> I suspect Usama's patches are closer to land so it's better to handle > >>>> this in this series, but I will leave it up to Usama and > >>>> Chuanhua/Barry to figure this out :) > >> > >> I believe handling this in swap-in might violate layer separation. > >> `swap_read_folio()` should be a reliable API to call, regardless of > >> whether `zeromap` is present. Therefore, the fix should likely be > >> within `zeromap` but not this `swap-in`. I’ll take a look at this with > >> Usama :-) > > > > I meant handling it within this series to avoid blocking Usama > > patches, not within this code. Thanks for taking a look, I am sure you > > and Usama will figure out the best way forward :) > > Hi Barry and Yosry, > > Is the best (and quickest) way forward to have a v8 of this with > https://lore.kernel.org/all/20240904055522.2376-1-21cnbao@gmail.com/ > as the first patch, and using swap_zeromap_entries_count in alloc_swap_folio > in this support large folios swap-in patch? Yes, Usama. i can actually do a check: zeromap_cnt = swap_zeromap_entries_count(entry, nr); /* swap_read_folio() can handle inconsistent zeromap in multiple entries */ if (zeromap_cnt > 0 && zeromap_cnt < nr) try next order; On the other hand, if you read the code of zRAM, you will find zRAM has exactly the same mechanism as zeromap but zRAM can even do more by same_pages filled. since zRAM does the job in swapfile layer, there is no this kind of consistency issue like zeromap. So I feel for zRAM case, we don't need zeromap at all as there are duplicated efforts while I really appreciate your job which can benefit all swapfiles. i mean, zRAM has the ability to check "zero"(and also non-zero but same content). after zeromap checks zeromap, zRAM will check again: static int zram_write_page(struct zram *zram, struct page *page, u32 index) { ... if (page_same_filled(mem, &element)) { kunmap_local(mem); /* Free memory associated with this sector now. */ flags = ZRAM_SAME; atomic64_inc(&zram->stats.same_pages); goto out; } ... } So it seems that zeromap might slightly impact my zRAM use case. I'm not blaming you, just pointing out that there might be some overlap in effort here :-) > > Thanks, > Usama Thanks Barry
On 05/09/2024 00:10, Barry Song wrote: > On Thu, Sep 5, 2024 at 9:30 AM Usama Arif <usamaarif642@gmail.com> wrote: >> >> >> >> On 03/09/2024 23:05, Yosry Ahmed wrote: >>> On Tue, Sep 3, 2024 at 2:36 PM Barry Song <21cnbao@gmail.com> wrote: >>>> >>>> On Wed, Sep 4, 2024 at 8:08 AM Andrew Morton <akpm@linux-foundation.org> wrote: >>>>> >>>>> On Tue, 3 Sep 2024 11:38:37 -0700 Yosry Ahmed <yosryahmed@google.com> wrote: >>>>> >>>>>>> [ 39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007 >>>>>>> [ 39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 >>>>>>> [ 39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518 >>>>>>> [ 39.158998] </TASK> >>>>>>> [ 39.159226] ---[ end trace 0000000000000000 ]--- >>>>>>> >>>>>>> After reverting this or Usama's "mm: store zero pages to be swapped >>>>>>> out in a bitmap", the problem is gone. I think these two patches may >>>>>>> have some conflict that needs to be resolved. >>>>>> >>>>>> Yup. I saw this conflict coming and specifically asked for this >>>>>> warning to be added in Usama's patch to catch it [1]. It served its >>>>>> purpose. >>>>>> >>>>>> Usama's patch does not handle large folio swapin, because at the time >>>>>> it was written we didn't have it. We expected Usama's series to land >>>>>> sooner than this one, so the warning was to make sure that this series >>>>>> handles large folio swapin in the zeromap code. Now that they are both >>>>>> in mm-unstable, we are gonna have to figure this out. >>>>>> >>>>>> I suspect Usama's patches are closer to land so it's better to handle >>>>>> this in this series, but I will leave it up to Usama and >>>>>> Chuanhua/Barry to figure this out :) >>>> >>>> I believe handling this in swap-in might violate layer separation. >>>> `swap_read_folio()` should be a reliable API to call, regardless of >>>> whether `zeromap` is present. Therefore, the fix should likely be >>>> within `zeromap` but not this `swap-in`. I’ll take a look at this with >>>> Usama :-) >>> >>> I meant handling it within this series to avoid blocking Usama >>> patches, not within this code. Thanks for taking a look, I am sure you >>> and Usama will figure out the best way forward :) >> >> Hi Barry and Yosry, >> >> Is the best (and quickest) way forward to have a v8 of this with >> https://lore.kernel.org/all/20240904055522.2376-1-21cnbao@gmail.com/ >> as the first patch, and using swap_zeromap_entries_count in alloc_swap_folio >> in this support large folios swap-in patch? > > Yes, Usama. i can actually do a check: > > zeromap_cnt = swap_zeromap_entries_count(entry, nr); > > /* swap_read_folio() can handle inconsistent zeromap in multiple entries */ > if (zeromap_cnt > 0 && zeromap_cnt < nr) > try next order; > > On the other hand, if you read the code of zRAM, you will find zRAM has > exactly the same mechanism as zeromap but zRAM can even do more > by same_pages filled. since zRAM does the job in swapfile layer, there > is no this kind of consistency issue like zeromap. > > So I feel for zRAM case, we don't need zeromap at all as there are duplicated > efforts while I really appreciate your job which can benefit all swapfiles. > i mean, zRAM has the ability to check "zero"(and also non-zero but same > content). after zeromap checks zeromap, zRAM will check again: > Yes, so there is a reason for having the zeromap patches, which I have outlined in the coverletter. https://lore.kernel.org/all/20240627105730.3110705-1-usamaarif642@gmail.com/ There are usecases where zswap/zram might not be used in production. We can reduce I/O and flash wear in those cases by a large amount. Also running in Meta production, we found that the number of non-zero filled complete pages were less than 1%, so essentially its only the zero-filled pages that matter. I believe after zeromap, it might be a good idea to remove the page_same_filled check from zram code? Its not really a problem if its kept as well as I dont believe any zero-filled pages should reach zram_write_page? > static int zram_write_page(struct zram *zram, struct page *page, u32 index) > { > ... > > if (page_same_filled(mem, &element)) { > kunmap_local(mem); > /* Free memory associated with this sector now. */ > flags = ZRAM_SAME; > atomic64_inc(&zram->stats.same_pages); > goto out; > } > ... > } > > So it seems that zeromap might slightly impact my zRAM use case. I'm not > blaming you, just pointing out that there might be some overlap in effort > here :-) > >> >> Thanks, >> Usama > > Thanks > Barry
On Thu, Sep 5, 2024 at 11:23 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 05/09/2024 00:10, Barry Song wrote: > > On Thu, Sep 5, 2024 at 9:30 AM Usama Arif <usamaarif642@gmail.com> wrote: > >> > >> > >> > >> On 03/09/2024 23:05, Yosry Ahmed wrote: > >>> On Tue, Sep 3, 2024 at 2:36 PM Barry Song <21cnbao@gmail.com> wrote: > >>>> > >>>> On Wed, Sep 4, 2024 at 8:08 AM Andrew Morton <akpm@linux-foundation.org> wrote: > >>>>> > >>>>> On Tue, 3 Sep 2024 11:38:37 -0700 Yosry Ahmed <yosryahmed@google.com> wrote: > >>>>> > >>>>>>> [ 39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007 > >>>>>>> [ 39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 > >>>>>>> [ 39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619d5518 > >>>>>>> [ 39.158998] </TASK> > >>>>>>> [ 39.159226] ---[ end trace 0000000000000000 ]--- > >>>>>>> > >>>>>>> After reverting this or Usama's "mm: store zero pages to be swapped > >>>>>>> out in a bitmap", the problem is gone. I think these two patches may > >>>>>>> have some conflict that needs to be resolved. > >>>>>> > >>>>>> Yup. I saw this conflict coming and specifically asked for this > >>>>>> warning to be added in Usama's patch to catch it [1]. It served its > >>>>>> purpose. > >>>>>> > >>>>>> Usama's patch does not handle large folio swapin, because at the time > >>>>>> it was written we didn't have it. We expected Usama's series to land > >>>>>> sooner than this one, so the warning was to make sure that this series > >>>>>> handles large folio swapin in the zeromap code. Now that they are both > >>>>>> in mm-unstable, we are gonna have to figure this out. > >>>>>> > >>>>>> I suspect Usama's patches are closer to land so it's better to handle > >>>>>> this in this series, but I will leave it up to Usama and > >>>>>> Chuanhua/Barry to figure this out :) > >>>> > >>>> I believe handling this in swap-in might violate layer separation. > >>>> `swap_read_folio()` should be a reliable API to call, regardless of > >>>> whether `zeromap` is present. Therefore, the fix should likely be > >>>> within `zeromap` but not this `swap-in`. I’ll take a look at this with > >>>> Usama :-) > >>> > >>> I meant handling it within this series to avoid blocking Usama > >>> patches, not within this code. Thanks for taking a look, I am sure you > >>> and Usama will figure out the best way forward :) > >> > >> Hi Barry and Yosry, > >> > >> Is the best (and quickest) way forward to have a v8 of this with > >> https://lore.kernel.org/all/20240904055522.2376-1-21cnbao@gmail.com/ > >> as the first patch, and using swap_zeromap_entries_count in alloc_swap_folio > >> in this support large folios swap-in patch? > > > > Yes, Usama. i can actually do a check: > > > > zeromap_cnt = swap_zeromap_entries_count(entry, nr); > > > > /* swap_read_folio() can handle inconsistent zeromap in multiple entries */ > > if (zeromap_cnt > 0 && zeromap_cnt < nr) > > try next order; > > > > On the other hand, if you read the code of zRAM, you will find zRAM has > > exactly the same mechanism as zeromap but zRAM can even do more > > by same_pages filled. since zRAM does the job in swapfile layer, there > > is no this kind of consistency issue like zeromap. > > > > So I feel for zRAM case, we don't need zeromap at all as there are duplicated > > efforts while I really appreciate your job which can benefit all swapfiles. > > i mean, zRAM has the ability to check "zero"(and also non-zero but same > > content). after zeromap checks zeromap, zRAM will check again: > > > > Yes, so there is a reason for having the zeromap patches, which I have outlined > in the coverletter. > > https://lore.kernel.org/all/20240627105730.3110705-1-usamaarif642@gmail.com/ > > There are usecases where zswap/zram might not be used in production. > We can reduce I/O and flash wear in those cases by a large amount. > > Also running in Meta production, we found that the number of non-zero filled > complete pages were less than 1%, so essentially its only the zero-filled pages > that matter. I don't have data on Android phones, i'd like to see if phones have exactly the same ratio that non-zero same page is rare. > > I believe after zeromap, it might be a good idea to remove the page_same_filled > check from zram code? Its not really a problem if its kept as well as I dont > believe any zero-filled pages should reach zram_write_page? > > > static int zram_write_page(struct zram *zram, struct page *page, u32 index) > > { > > ... > > > > if (page_same_filled(mem, &element)) { > > kunmap_local(mem); > > /* Free memory associated with this sector now. */ > > flags = ZRAM_SAME; > > atomic64_inc(&zram->stats.same_pages); > > goto out; > > } > > ... > > } > > > > So it seems that zeromap might slightly impact my zRAM use case. I'm not > > blaming you, just pointing out that there might be some overlap in effort > > here :-) > > > >> > >> Thanks, > >> Usama > > Thanks Barry
[..] > > > > On the other hand, if you read the code of zRAM, you will find zRAM has > > exactly the same mechanism as zeromap but zRAM can even do more > > by same_pages filled. since zRAM does the job in swapfile layer, there > > is no this kind of consistency issue like zeromap. > > > > So I feel for zRAM case, we don't need zeromap at all as there are duplicated > > efforts while I really appreciate your job which can benefit all swapfiles. > > i mean, zRAM has the ability to check "zero"(and also non-zero but same > > content). after zeromap checks zeromap, zRAM will check again: > > > > Yes, so there is a reason for having the zeromap patches, which I have outlined > in the coverletter. > > https://lore.kernel.org/all/20240627105730.3110705-1-usamaarif642@gmail.com/ > > There are usecases where zswap/zram might not be used in production. > We can reduce I/O and flash wear in those cases by a large amount. > > Also running in Meta production, we found that the number of non-zero filled > complete pages were less than 1%, so essentially its only the zero-filled pages > that matter. > > I believe after zeromap, it might be a good idea to remove the page_same_filled > check from zram code? Its not really a problem if its kept as well as I dont > believe any zero-filled pages should reach zram_write_page? I brought this up before and Sergey pointed out that zram is sometimes used as a block device without swap, and that use case would benefit from having this handling in zram. That being said, I have no idea how many people care about this specific scenario.
On Thu, Sep 5, 2024 at 7:36 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > [..] > > > > > > On the other hand, if you read the code of zRAM, you will find zRAM has > > > exactly the same mechanism as zeromap but zRAM can even do more > > > by same_pages filled. since zRAM does the job in swapfile layer, there > > > is no this kind of consistency issue like zeromap. > > > > > > So I feel for zRAM case, we don't need zeromap at all as there are duplicated > > > efforts while I really appreciate your job which can benefit all swapfiles. > > > i mean, zRAM has the ability to check "zero"(and also non-zero but same > > > content). after zeromap checks zeromap, zRAM will check again: > > > > > > > Yes, so there is a reason for having the zeromap patches, which I have outlined > > in the coverletter. > > > > https://lore.kernel.org/all/20240627105730.3110705-1-usamaarif642@gmail.com/ > > > > There are usecases where zswap/zram might not be used in production. > > We can reduce I/O and flash wear in those cases by a large amount. > > > > Also running in Meta production, we found that the number of non-zero filled > > complete pages were less than 1%, so essentially its only the zero-filled pages > > that matter. > > > > I believe after zeromap, it might be a good idea to remove the page_same_filled > > check from zram code? Its not really a problem if its kept as well as I dont > > believe any zero-filled pages should reach zram_write_page? > > I brought this up before and Sergey pointed out that zram is sometimes > used as a block device without swap, and that use case would benefit > from having this handling in zram. That being said, I have no idea how > many people care about this specific scenario. Hi Usama/Yosry, We successfully gathered page_same_filled data for zram on Android. Interestingly, our findings differ from yours on zswap. Hailong discovered that around 85-86% of the page_same_filled data consists of zeros, while about 15% are non-zero. We suspect that on Android or similar systems, some graphics or media data might be duplicated at times, such as a red block displayed on the screen. Does this suggest that page_same_filled could still provide some benefits in zram cases? Thanks Barry
On 23/09/2024 00:57, Barry Song wrote: > On Thu, Sep 5, 2024 at 7:36 AM Yosry Ahmed <yosryahmed@google.com> wrote: >> >> [..] >>>> >>>> On the other hand, if you read the code of zRAM, you will find zRAM has >>>> exactly the same mechanism as zeromap but zRAM can even do more >>>> by same_pages filled. since zRAM does the job in swapfile layer, there >>>> is no this kind of consistency issue like zeromap. >>>> >>>> So I feel for zRAM case, we don't need zeromap at all as there are duplicated >>>> efforts while I really appreciate your job which can benefit all swapfiles. >>>> i mean, zRAM has the ability to check "zero"(and also non-zero but same >>>> content). after zeromap checks zeromap, zRAM will check again: >>>> >>> >>> Yes, so there is a reason for having the zeromap patches, which I have outlined >>> in the coverletter. >>> >>> https://lore.kernel.org/all/20240627105730.3110705-1-usamaarif642@gmail.com/ >>> >>> There are usecases where zswap/zram might not be used in production. >>> We can reduce I/O and flash wear in those cases by a large amount. >>> >>> Also running in Meta production, we found that the number of non-zero filled >>> complete pages were less than 1%, so essentially its only the zero-filled pages >>> that matter. >>> >>> I believe after zeromap, it might be a good idea to remove the page_same_filled >>> check from zram code? Its not really a problem if its kept as well as I dont >>> believe any zero-filled pages should reach zram_write_page? >> >> I brought this up before and Sergey pointed out that zram is sometimes >> used as a block device without swap, and that use case would benefit >> from having this handling in zram. That being said, I have no idea how >> many people care about this specific scenario. > > Hi Usama/Yosry, > > We successfully gathered page_same_filled data for zram on Android. > Interestingly, > our findings differ from yours on zswap. > > Hailong discovered that around 85-86% of the page_same_filled data > consists of zeros, > while about 15% are non-zero. We suspect that on Android or similar > systems, some > graphics or media data might be duplicated at times, such as a red > block displayed > on the screen. > > Does this suggest that page_same_filled could still provide some > benefits in zram > cases? Hi Barry, Thanks for the data, its very interesting to know this from mobile side. Eventhough its not 99% that I observed, I do feel 85% is still quite high. The 2 main reasons for the patcheset to store zero pages to be swapped out in a bitmap were for applications that use swap but not zswap/zram (reducing I/O and flash wear), and simplifying zswap code. It also meant fewer zswap_entry structs in memory which would offset the memory usage by bitmap. Yosry mentioned that Sergey pointed out that zram is sometimes used as a block device without swap, and that use case would benefit from having this handling in zram. Will that case also not go through swap_writepage and therefore be takencare of by swap_info_struct->zeromap? Also just curious if there are cases in mobile where only swap is used, but not zswap/zram? I think even with 85% zero-filled pages, we could get rid of page_same_filled especially if the zram without swap case is handled by swap_info_struct->zeromap. But don't have strong preference. Thanks, Usama > > Thanks > Barry
On Mon, Sep 23, 2024 at 11:22:30AM +0100, Usama Arif wrote: > On 23/09/2024 00:57, Barry Song wrote: > > On Thu, Sep 5, 2024 at 7:36 AM Yosry Ahmed <yosryahmed@google.com> wrote: > >>>> On the other hand, if you read the code of zRAM, you will find zRAM has > >>>> exactly the same mechanism as zeromap but zRAM can even do more > >>>> by same_pages filled. since zRAM does the job in swapfile layer, there > >>>> is no this kind of consistency issue like zeromap. > >>>> > >>>> So I feel for zRAM case, we don't need zeromap at all as there are duplicated > >>>> efforts while I really appreciate your job which can benefit all swapfiles. > >>>> i mean, zRAM has the ability to check "zero"(and also non-zero but same > >>>> content). after zeromap checks zeromap, zRAM will check again: > >>>> > >>> > >>> Yes, so there is a reason for having the zeromap patches, which I have outlined > >>> in the coverletter. > >>> > >>> https://lore.kernel.org/all/20240627105730.3110705-1-usamaarif642@gmail.com/ > >>> > >>> There are usecases where zswap/zram might not be used in production. > >>> We can reduce I/O and flash wear in those cases by a large amount. > >>> > >>> Also running in Meta production, we found that the number of non-zero filled > >>> complete pages were less than 1%, so essentially its only the zero-filled pages > >>> that matter. > >>> > >>> I believe after zeromap, it might be a good idea to remove the page_same_filled > >>> check from zram code? Its not really a problem if its kept as well as I dont > >>> believe any zero-filled pages should reach zram_write_page? > >> > >> I brought this up before and Sergey pointed out that zram is sometimes > >> used as a block device without swap, and that use case would benefit > >> from having this handling in zram. That being said, I have no idea how > >> many people care about this specific scenario. > > > > Hi Usama/Yosry, > > > > We successfully gathered page_same_filled data for zram on Android. > > Interestingly, > > our findings differ from yours on zswap. > > > > Hailong discovered that around 85-86% of the page_same_filled data > > consists of zeros, > > while about 15% are non-zero. We suspect that on Android or similar > > systems, some > > graphics or media data might be duplicated at times, such as a red > > block displayed > > on the screen. > > > > Does this suggest that page_same_filled could still provide some > > benefits in zram > > cases? > > Hi Barry, > > Thanks for the data, its very interesting to know this from mobile side. > Eventhough its not 99% that I observed, I do feel 85% is still quite high. Would it be possible to benchmark Android with zram only optimizing zero pages? diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index c3d245617083..f6ded491fd00 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -211,6 +211,9 @@ static bool page_same_filled(void *ptr, unsigned long *element) page = (unsigned long *)ptr; val = page[0]; + if (val) + return false; + if (val != page[last_pos]) return false; My take is that, if this is worth optimizing for, then it's probably worth optimizing for in the generic swap layer too. It makes sense to maintain feature parity if we one day want Android to work with zswap. > The 2 main reasons for the patcheset to store zero pages to be > swapped out in a bitmap were for applications that use swap but > not zswap/zram (reducing I/O and flash wear), and simplifying zswap > code. It also meant fewer zswap_entry structs in memory which would > offset the memory usage by bitmap. > > Yosry mentioned that Sergey pointed out that zram is sometimes > used as a block device without swap, and that use case would benefit > from having this handling in zram. Will that case also not go through > swap_writepage and therefore be takencare of by swap_info_struct->zeromap? No, it won't. However, the write bios sent from swap have REQ_SWAP set. Zram could make its same-filled optimization conditional on that flag if it wants to maintain it for the raw block device use case.
On Mon, Sep 23, 2024 at 5:10 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Mon, Sep 23, 2024 at 11:22:30AM +0100, Usama Arif wrote: > > On 23/09/2024 00:57, Barry Song wrote: > > > On Thu, Sep 5, 2024 at 7:36 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > >>>> On the other hand, if you read the code of zRAM, you will find zRAM has > > >>>> exactly the same mechanism as zeromap but zRAM can even do more > > >>>> by same_pages filled. since zRAM does the job in swapfile layer, there > > >>>> is no this kind of consistency issue like zeromap. > > >>>> > > >>>> So I feel for zRAM case, we don't need zeromap at all as there are duplicated > > >>>> efforts while I really appreciate your job which can benefit all swapfiles. > > >>>> i mean, zRAM has the ability to check "zero"(and also non-zero but same > > >>>> content). after zeromap checks zeromap, zRAM will check again: > > >>>> > > >>> > > >>> Yes, so there is a reason for having the zeromap patches, which I have outlined > > >>> in the coverletter. > > >>> > > >>> https://lore.kernel.org/all/20240627105730.3110705-1-usamaarif642@gmail.com/ > > >>> > > >>> There are usecases where zswap/zram might not be used in production. > > >>> We can reduce I/O and flash wear in those cases by a large amount. > > >>> > > >>> Also running in Meta production, we found that the number of non-zero filled > > >>> complete pages were less than 1%, so essentially its only the zero-filled pages > > >>> that matter. > > >>> > > >>> I believe after zeromap, it might be a good idea to remove the page_same_filled > > >>> check from zram code? Its not really a problem if its kept as well as I dont > > >>> believe any zero-filled pages should reach zram_write_page? > > >> > > >> I brought this up before and Sergey pointed out that zram is sometimes > > >> used as a block device without swap, and that use case would benefit > > >> from having this handling in zram. That being said, I have no idea how > > >> many people care about this specific scenario. > > > > > > Hi Usama/Yosry, > > > > > > We successfully gathered page_same_filled data for zram on Android. > > > Interestingly, > > > our findings differ from yours on zswap. > > > > > > Hailong discovered that around 85-86% of the page_same_filled data > > > consists of zeros, > > > while about 15% are non-zero. We suspect that on Android or similar > > > systems, some > > > graphics or media data might be duplicated at times, such as a red > > > block displayed > > > on the screen. > > > > > > Does this suggest that page_same_filled could still provide some > > > benefits in zram > > > cases? > > > > Hi Barry, > > > > Thanks for the data, its very interesting to know this from mobile side. > > Eventhough its not 99% that I observed, I do feel 85% is still quite high. > > Would it be possible to benchmark Android with zram only optimizing > zero pages? > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c > index c3d245617083..f6ded491fd00 100644 > --- a/drivers/block/zram/zram_drv.c > +++ b/drivers/block/zram/zram_drv.c > @@ -211,6 +211,9 @@ static bool page_same_filled(void *ptr, unsigned long *element) > page = (unsigned long *)ptr; > val = page[0]; > > + if (val) > + return false; > + > if (val != page[last_pos]) > return false; > > My take is that, if this is worth optimizing for, then it's probably > worth optimizing for in the generic swap layer too. It makes sense to > maintain feature parity if we one day want Android to work with zswap. I am not sure if it's worth it for the generic swap layer. We would need to store 8 bytes per swap entry to maintain feature parity, that's about 0.2% of swap capacity as consistent memory overhead. Swap capacity is usually higher than the actual size of swapped data. IIUC the data you gathered from prod showed that 1% of same filled pages were non-zero, and 10-20% of swapped data was same filled [1]. That means that ~0.15% of swapped data is non-zero same-filled. With zswap, assuming a 3:1 compression ratio, we'd be paying 0.2% of swap capacity to save around 0.05% of swapped data in memory. I think it may be worse because that compression ratio may be higher for same-filled data. With SSD swap, I am not sure if 0.15% reduction in IO is worth the memory overhead. OTOH, zram keeps track of the same-filled value for free because it overlays the zsmalloc handle (like zswap used to do). So the same tradeoffs do not apply. Barry mentioned that 15% of same-filled pages are non-zero in their Android experiment, but what % of total swapped memory is this, and how much space does it take if we just compress it instead? IOW, how much memory is this really saving with zram (especially that metadata is statically allocated)?
diff --git a/mm/memory.c b/mm/memory.c index b9fe2f354878..7aa0358a4160 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3986,6 +3986,184 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) return VM_FAULT_SIGBUS; } +static struct folio *__alloc_swap_folio(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct folio *folio; + swp_entry_t entry; + + folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, + vmf->address, false); + if (!folio) + return NULL; + + entry = pte_to_swp_entry(vmf->orig_pte); + if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, + GFP_KERNEL, entry)) { + folio_put(folio); + return NULL; + } + + return folio; +} + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +/* + * Check if the PTEs within a range are contiguous swap entries + * and have no cache when check_no_cache is true. + */ +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, + int nr_pages, bool check_no_cache) +{ + struct swap_info_struct *si; + unsigned long addr; + swp_entry_t entry; + pgoff_t offset; + int idx, i; + pte_t pte; + + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); + idx = (vmf->address - addr) / PAGE_SIZE; + pte = ptep_get(ptep); + + if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx))) + return false; + entry = pte_to_swp_entry(pte); + offset = swp_offset(entry); + if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages) + return false; + + if (!check_no_cache) + return true; + + si = swp_swap_info(entry); + /* + * While allocating a large folio and doing swap_read_folio, which is + * the case the being faulted pte doesn't have swapcache. We need to + * ensure all PTEs have no cache as well, otherwise, we might go to + * swap devices while the content is in swapcache. + */ + for (i = 0; i < nr_pages; i++) { + if ((si->swap_map[offset + i] & SWAP_HAS_CACHE)) + return false; + } + + return true; +} + +static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset, + unsigned long addr, + unsigned long orders) +{ + int order, nr; + + order = highest_order(orders); + + /* + * To swap in a THP with nr pages, we require that its first swap_offset + * is aligned with that number, as it was when the THP was swapped out. + * This helps filter out most invalid entries. + */ + while (orders) { + nr = 1 << order; + if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr) + break; + order = next_order(&orders, order); + } + + return orders; +} + +static struct folio *alloc_swap_folio(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + unsigned long orders; + struct folio *folio; + unsigned long addr; + swp_entry_t entry; + spinlock_t *ptl; + pte_t *pte; + gfp_t gfp; + int order; + + /* + * If uffd is active for the vma we need per-page fault fidelity to + * maintain the uffd semantics. + */ + if (unlikely(userfaultfd_armed(vma))) + goto fallback; + + /* + * A large swapped out folio could be partially or fully in zswap. We + * lack handling for such cases, so fallback to swapping in order-0 + * folio. + */ + if (!zswap_never_enabled()) + goto fallback; + + entry = pte_to_swp_entry(vmf->orig_pte); + /* + * Get a list of all the (large) orders below PMD_ORDER that are enabled + * and suitable for swapping THP. + */ + orders = thp_vma_allowable_orders(vma, vma->vm_flags, + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); + orders = thp_vma_suitable_orders(vma, vmf->address, orders); + orders = thp_swap_suitable_orders(swp_offset(entry), + vmf->address, orders); + + if (!orders) + goto fallback; + + pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, + vmf->address & PMD_MASK, &ptl); + if (unlikely(!pte)) + goto fallback; + + /* + * For do_swap_page, find the highest order where the aligned range is + * completely swap entries with contiguous swap offsets. + */ + order = highest_order(orders); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order, true)) + break; + order = next_order(&orders, order); + } + + pte_unmap_unlock(pte, ptl); + + /* Try allocating the highest of the remaining orders. */ + gfp = vma_thp_gfp_mask(vma); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + folio = vma_alloc_folio(gfp, order, vma, addr, true); + if (folio) { + if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, + gfp, entry)) + return folio; + folio_put(folio); + } + order = next_order(&orders, order); + } + +fallback: + return __alloc_swap_folio(vmf); +} +#else /* !CONFIG_TRANSPARENT_HUGEPAGE */ +static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, + int nr_pages, bool check_no_cache) +{ + return false; +} + +static struct folio *alloc_swap_folio(struct vm_fault *vmf) +{ + return __alloc_swap_folio(vmf); +} +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -4074,34 +4252,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (!folio) { if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1) { - /* - * Prevent parallel swapin from proceeding with - * the cache flag. Otherwise, another thread may - * finish swapin first, free the entry, and swapout - * reusing the same entry. It's undetectable as - * pte_same() returns true due to entry reuse. - */ - if (swapcache_prepare(entry, 1)) { - /* Relax a bit to prevent rapid repeated page faults */ - schedule_timeout_uninterruptible(1); - goto out; - } - need_clear_cache = true; - /* skip swapcache */ - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, - vma, vmf->address, false); + folio = alloc_swap_folio(vmf); if (folio) { __folio_set_locked(folio); __folio_set_swapbacked(folio); - if (mem_cgroup_swapin_charge_folio(folio, - vma->vm_mm, GFP_KERNEL, - entry)) { - ret = VM_FAULT_OOM; + nr_pages = folio_nr_pages(folio); + if (folio_test_large(folio)) + entry.val = ALIGN_DOWN(entry.val, nr_pages); + /* + * Prevent parallel swapin from proceeding with + * the cache flag. Otherwise, another thread + * may finish swapin first, free the entry, and + * swapout reusing the same entry. It's + * undetectable as pte_same() returns true due + * to entry reuse. + */ + if (swapcache_prepare(entry, nr_pages)) { + /* + * Relax a bit to prevent rapid + * repeated page faults. + */ + schedule_timeout_uninterruptible(1); goto out_page; } - mem_cgroup_swapin_uncharge_swap(entry, 1); + need_clear_cache = true; + + mem_cgroup_swapin_uncharge_swap(entry, nr_pages); shadow = get_shadow_from_swap_cache(entry); if (shadow) @@ -4207,6 +4385,23 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_nomap; } + /* allocated large folios for SWP_SYNCHRONOUS_IO */ + if (folio_test_large(folio) && !folio_test_swapcache(folio)) { + unsigned long nr = folio_nr_pages(folio); + unsigned long folio_start = ALIGN_DOWN(vmf->address, + nr * PAGE_SIZE); + unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE; + pte_t *folio_ptep = vmf->pte - idx; + + if (!can_swapin_thp(vmf, folio_ptep, nr, false)) + goto out_nomap; + + page_idx = idx; + address = folio_start; + ptep = folio_ptep; + goto check_folio; + } + nr_pages = 1; page_idx = 0; address = vmf->address; @@ -4338,11 +4533,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_add_lru_vma(folio, vma); } else if (!folio_test_anon(folio)) { /* - * We currently only expect small !anon folios, which are either - * fully exclusive or fully shared. If we ever get large folios - * here, we have to be careful. + * We currently only expect small !anon folios which are either + * fully exclusive or fully shared, or new allocated large + * folios which are fully exclusive. If we ever get large + * folios within swapcache here, we have to be careful. */ - VM_WARN_ON_ONCE(folio_test_large(folio)); + VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); folio_add_new_anon_rmap(folio, vma, address, rmap_flags); } else { @@ -4385,7 +4581,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) out: /* Clear the swap cache pin for direct swapin after PTL unlock */ if (need_clear_cache) - swapcache_clear(si, entry, 1); + swapcache_clear(si, entry, nr_pages); if (si) put_swap_device(si); return ret; @@ -4401,7 +4597,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_put(swapcache); } if (need_clear_cache) - swapcache_clear(si, entry, 1); + swapcache_clear(si, entry, nr_pages); if (si) put_swap_device(si); return ret;