Message ID | 20250307120141.1566673-1-qun-wei.lin@mediatek.com (mailing list archive) |
---|---|
Headers | show |
Series | Improve Zram by separating compression context from kswapd | expand |
On Sat, Mar 8, 2025 at 1:02 AM Qun-Wei Lin <qun-wei.lin@mediatek.com> wrote: > > This patch series introduces a new mechanism called kcompressd to > improve the efficiency of memory reclaiming in the operating system. The > main goal is to separate the tasks of page scanning and page compression > into distinct processes or threads, thereby reducing the load on the > kswapd thread and enhancing overall system performance under high memory > pressure conditions. > > Problem: > In the current system, the kswapd thread is responsible for both > scanning the LRU pages and compressing pages into the ZRAM. This > combined responsibility can lead to significant performance bottlenecks, > especially under high memory pressure. The kswapd thread becomes a > single point of contention, causing delays in memory reclaiming and > overall system performance degradation. > > Target: > The target of this invention is to improve the efficiency of memory > reclaiming. By separating the tasks of page scanning and page > compression into distinct processes or threads, the system can handle > memory pressure more effectively. Sounds great. However, we also have a time window where folios under writeback are kept, whereas previously, writeback was done synchronously without your patch. This may temporarily increase memory usage until the kept folios are re-scanned. So, you’ve observed that folio_rotate_reclaimable() runs shortly while the async thread completes compression? Then the kept folios are shortly re-scanned? > > Patch 1: > - Introduces 2 new feature flags, BLK_FEAT_READ_SYNCHRONOUS and > SWP_READ_SYNCHRONOUS_IO. > > Patch 2: > - Implemented the core functionality of Kcompressd and made necessary > modifications to the zram driver to support it. > > In our handheld devices, we found that applying this mechanism under high > memory pressure scenarios can increase the rate of pgsteal_anon per second > by over 260% compared to the situation with only kswapd. Sounds really great. What compression algorithm is being used? I assume that after switching to a different compression algorithms, the benefits will change significantly. For example, Zstd might not show as much improvement. How was the CPU usage ratio between page scan/unmap and compression observed before applying this patch? > > Qun-Wei Lin (2): > mm: Split BLK_FEAT_SYNCHRONOUS and SWP_SYNCHRONOUS_IO into separate > read and write flags > kcompressd: Add Kcompressd for accelerated zram compression > > drivers/block/brd.c | 3 +- > drivers/block/zram/Kconfig | 11 ++ > drivers/block/zram/Makefile | 3 +- > drivers/block/zram/kcompressd.c | 340 ++++++++++++++++++++++++++++++++ > drivers/block/zram/kcompressd.h | 25 +++ > drivers/block/zram/zram_drv.c | 21 +- > drivers/nvdimm/btt.c | 3 +- > drivers/nvdimm/pmem.c | 5 +- > include/linux/blkdev.h | 24 ++- > include/linux/swap.h | 31 +-- > mm/memory.c | 4 +- > mm/page_io.c | 6 +- > mm/swapfile.c | 7 +- > 13 files changed, 446 insertions(+), 37 deletions(-) > create mode 100644 drivers/block/zram/kcompressd.c > create mode 100644 drivers/block/zram/kcompressd.h > > -- > 2.45.2 > Thanks Barry
On Fri, Mar 7, 2025 at 4:02 AM Qun-Wei Lin <qun-wei.lin@mediatek.com> wrote: > > This patch series introduces a new mechanism called kcompressd to > improve the efficiency of memory reclaiming in the operating system. The > main goal is to separate the tasks of page scanning and page compression > into distinct processes or threads, thereby reducing the load on the > kswapd thread and enhancing overall system performance under high memory > pressure conditions. Please excuse my ignorance, but from your cover letter I still don't quite get what is the problem here? And how would decouple compression and scanning help? > > Problem: > In the current system, the kswapd thread is responsible for both > scanning the LRU pages and compressing pages into the ZRAM. This > combined responsibility can lead to significant performance bottlenecks, What bottleneck are we talking about? Is one stage slower than the other? > especially under high memory pressure. The kswapd thread becomes a > single point of contention, causing delays in memory reclaiming and > overall system performance degradation. > > Target: > The target of this invention is to improve the efficiency of memory > reclaiming. By separating the tasks of page scanning and page > compression into distinct processes or threads, the system can handle > memory pressure more effectively. I'm not a zram maintainer, so I'm definitely not trying to stop this patch. But whatever problem zram is facing will likely occur with zswap too, so I'd like to learn more :)
On Sat, Mar 8, 2025 at 12:03 PM Nhat Pham <nphamcs@gmail.com> wrote: > > On Fri, Mar 7, 2025 at 4:02 AM Qun-Wei Lin <qun-wei.lin@mediatek.com> wrote: > > > > This patch series introduces a new mechanism called kcompressd to > > improve the efficiency of memory reclaiming in the operating system. The > > main goal is to separate the tasks of page scanning and page compression > > into distinct processes or threads, thereby reducing the load on the > > kswapd thread and enhancing overall system performance under high memory > > pressure conditions. > > Please excuse my ignorance, but from your cover letter I still don't > quite get what is the problem here? And how would decouple compression > and scanning help? My understanding is as follows: When kswapd attempts to reclaim M anonymous folios and N file folios, the process involves the following steps: * t1: Time to scan and unmap anonymous folios * t2: Time to compress anonymous folios * t3: Time to reclaim file folios Currently, these steps are executed sequentially, meaning the total time required to reclaim M + N folios is t1 + t2 + t3. However, Qun-Wei's patch enables t1 + t3 and t2 to run in parallel, reducing the total time to max(t1 + t3, t2). This likely improves the reclamation speed, potentially reducing allocation stalls. I don’t have concrete data on this. Does Qun-Wei have detailed performance data? > > > > > Problem: > > In the current system, the kswapd thread is responsible for both > > scanning the LRU pages and compressing pages into the ZRAM. This > > combined responsibility can lead to significant performance bottlenecks, > > What bottleneck are we talking about? Is one stage slower than the other? > > > especially under high memory pressure. The kswapd thread becomes a > > single point of contention, causing delays in memory reclaiming and > > overall system performance degradation. > > > > Target: > > The target of this invention is to improve the efficiency of memory > > reclaiming. By separating the tasks of page scanning and page > > compression into distinct processes or threads, the system can handle > > memory pressure more effectively. > > I'm not a zram maintainer, so I'm definitely not trying to stop this > patch. But whatever problem zram is facing will likely occur with > zswap too, so I'd like to learn more :) Right, this is likely something that could be addressed more generally for zswap and zram. Thanks Barry
On Sat, 2025-03-08 at 08:34 +1300, Barry Song wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > > > On Sat, Mar 8, 2025 at 1:02 AM Qun-Wei Lin <qun-wei.lin@mediatek.com> > wrote: > > > > This patch series introduces a new mechanism called kcompressd to > > improve the efficiency of memory reclaiming in the operating > > system. The > > main goal is to separate the tasks of page scanning and page > > compression > > into distinct processes or threads, thereby reducing the load on > > the > > kswapd thread and enhancing overall system performance under high > > memory > > pressure conditions. > > > > Problem: > > In the current system, the kswapd thread is responsible for both > > scanning the LRU pages and compressing pages into the ZRAM. This > > combined responsibility can lead to significant performance > > bottlenecks, > > especially under high memory pressure. The kswapd thread becomes a > > single point of contention, causing delays in memory reclaiming > > and > > overall system performance degradation. > > > > Target: > > The target of this invention is to improve the efficiency of > > memory > > reclaiming. By separating the tasks of page scanning and page > > compression into distinct processes or threads, the system can > > handle > > memory pressure more effectively. > > Sounds great. However, we also have a time window where folios under > writeback are kept, whereas previously, writeback was done > synchronously > without your patch. This may temporarily increase memory usage until > the > kept folios are re-scanned. > > So, you’ve observed that folio_rotate_reclaimable() runs shortly > while the > async thread completes compression? Then the kept folios are shortly > re-scanned? > Yes, these folios may need to be re-scanned, so folio_rotate_reclaimable() will be run. This can be observed from the increase in pgrotated in /proc/vmstat. > > > > Patch 1: > > - Introduces 2 new feature flags, BLK_FEAT_READ_SYNCHRONOUS and > > SWP_READ_SYNCHRONOUS_IO. > > > > Patch 2: > > - Implemented the core functionality of Kcompressd and made > > necessary > > modifications to the zram driver to support it. > > > > In our handheld devices, we found that applying this mechanism > > under high > > memory pressure scenarios can increase the rate of pgsteal_anon per > > second > > by over 260% compared to the situation with only kswapd. > > Sounds really great. > > What compression algorithm is being used? I assume that after > switching to a > different compression algorithms, the benefits will change > significantly. For > example, Zstd might not show as much improvement. > How was the CPU usage ratio between page scan/unmap and compression > observed before applying this patch? > The original tests were based on LZ4. We have observed that the CPU time spent on scanning the LRU and compressing folios is approximately in 3:7. We also try ZSTD as the zram backend, but the the number of anonymous folios reclaimed per second did not differ significantly from LZ4 (the benefits were far less compared to what could be achieved with parallel processing). Even with ZSTD, we were still able to reach around 800,000 pgsteal_anon per second using kcompressd. > > > > Qun-Wei Lin (2): > > mm: Split BLK_FEAT_SYNCHRONOUS and SWP_SYNCHRONOUS_IO into > > separate > > read and write flags > > kcompressd: Add Kcompressd for accelerated zram compression > > > > drivers/block/brd.c | 3 +- > > drivers/block/zram/Kconfig | 11 ++ > > drivers/block/zram/Makefile | 3 +- > > drivers/block/zram/kcompressd.c | 340 > > ++++++++++++++++++++++++++++++++ > > drivers/block/zram/kcompressd.h | 25 +++ > > drivers/block/zram/zram_drv.c | 21 +- > > drivers/nvdimm/btt.c | 3 +- > > drivers/nvdimm/pmem.c | 5 +- > > include/linux/blkdev.h | 24 ++- > > include/linux/swap.h | 31 +-- > > mm/memory.c | 4 +- > > mm/page_io.c | 6 +- > > mm/swapfile.c | 7 +- > > 13 files changed, 446 insertions(+), 37 deletions(-) > > create mode 100644 drivers/block/zram/kcompressd.c > > create mode 100644 drivers/block/zram/kcompressd.h > > > > -- > > 2.45.2 > > > > Thanks > Barry Best Regards, Qun-wei
On Sat, 2025-03-08 at 18:41 +1300, Barry Song wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > > > On Sat, Mar 8, 2025 at 12:03 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > On Fri, Mar 7, 2025 at 4:02 AM Qun-Wei Lin > > <qun-wei.lin@mediatek.com> wrote: > > > > > > This patch series introduces a new mechanism called kcompressd to > > > improve the efficiency of memory reclaiming in the operating > > > system. The > > > main goal is to separate the tasks of page scanning and page > > > compression > > > into distinct processes or threads, thereby reducing the load on > > > the > > > kswapd thread and enhancing overall system performance under high > > > memory > > > pressure conditions. > > > > Please excuse my ignorance, but from your cover letter I still > > don't > > quite get what is the problem here? And how would decouple > > compression > > and scanning help? > > My understanding is as follows: > > When kswapd attempts to reclaim M anonymous folios and N file folios, > the process involves the following steps: > > * t1: Time to scan and unmap anonymous folios > * t2: Time to compress anonymous folios > * t3: Time to reclaim file folios > > Currently, these steps are executed sequentially, meaning the total > time > required to reclaim M + N folios is t1 + t2 + t3. > > However, Qun-Wei's patch enables t1 + t3 and t2 to run in parallel, > reducing the total time to max(t1 + t3, t2). This likely improves the > reclamation speed, potentially reducing allocation stalls. > > I don’t have concrete data on this. Does Qun-Wei have detailed > performance data? > Thank you for your explanation. Compared to the original single kswapd, we expect t1 to have a slight increase in re-scan time. However, since our kcompressd can focus on compression tasks and we can have multiple kcompressd instances (kcompressd0, kcompressd1, ...) running in parallel, we anticipate that the number of times a folio needs be re- scanned will not be too many. In our experiments, we fixed the CPU and DRAM at a certain frequency. We created a high memory pressure enviroment using a memory eater and recorded the increase in pgsteal_anon per second, which was around 300, 000. Then we applied our patch and measured again, that pgsteal_anon/s increased to over 800,000. > > > > > > > > Problem: > > > In the current system, the kswapd thread is responsible for both > > > scanning the LRU pages and compressing pages into the ZRAM. This > > > combined responsibility can lead to significant performance > > > bottlenecks, > > > > What bottleneck are we talking about? Is one stage slower than the > > other? > > > > > especially under high memory pressure. The kswapd thread becomes > > > a > > > single point of contention, causing delays in memory reclaiming > > > and > > > overall system performance degradation. > > > > > > Target: > > > The target of this invention is to improve the efficiency of > > > memory > > > reclaiming. By separating the tasks of page scanning and page > > > compression into distinct processes or threads, the system can > > > handle > > > memory pressure more effectively. > > > > I'm not a zram maintainer, so I'm definitely not trying to stop > > this > > patch. But whatever problem zram is facing will likely occur with > > zswap too, so I'd like to learn more :) > > Right, this is likely something that could be addressed more > generally > for zswap and zram. > Yes, we also hope to extend this to other swap devices, but currently, we have only modified zram. We are not very familiar with zswap and would like to ask if anyone has any suggestions for modifications? > Thanks > Barry Best Regards, Qun-wei
On Mon, Mar 10, 2025 at 6:22 AM Qun-wei Lin (林群崴) <Qun-wei.Lin@mediatek.com> wrote: > > > Thank you for your explanation. Compared to the original single kswapd, > we expect t1 to have a slight increase in re-scan time. However, since > our kcompressd can focus on compression tasks and we can have multiple > kcompressd instances (kcompressd0, kcompressd1, ...) running in > parallel, we anticipate that the number of times a folio needs be re- > scanned will not be too many. > > In our experiments, we fixed the CPU and DRAM at a certain frequency. > We created a high memory pressure enviroment using a memory eater and > recorded the increase in pgsteal_anon per second, which was around 300, > 000. Then we applied our patch and measured again, that pgsteal_anon/s > increased to over 800,000. > > > > > > > > > > > > Problem: > > > > In the current system, the kswapd thread is responsible for both > > > > scanning the LRU pages and compressing pages into the ZRAM. This > > > > combined responsibility can lead to significant performance > > > > bottlenecks, > > > > > > What bottleneck are we talking about? Is one stage slower than the > > > other? > > > > > > > especially under high memory pressure. The kswapd thread becomes > > > > a > > > > single point of contention, causing delays in memory reclaiming > > > > and > > > > overall system performance degradation. > > > > > > > > Target: > > > > The target of this invention is to improve the efficiency of > > > > memory > > > > reclaiming. By separating the tasks of page scanning and page > > > > compression into distinct processes or threads, the system can > > > > handle > > > > memory pressure more effectively. > > > > > > I'm not a zram maintainer, so I'm definitely not trying to stop > > > this > > > patch. But whatever problem zram is facing will likely occur with > > > zswap too, so I'd like to learn more :) > > > > Right, this is likely something that could be addressed more > > generally > > for zswap and zram. > > > > Yes, we also hope to extend this to other swap devices, but currently, > we have only modified zram. We are not very familiar with zswap and > would like to ask if anyone has any suggestions for modifications? > My understanding is right now schedule_bio_write is the work submission API right? We can make it generic, having it accept a callback and a generic untyped pointer which can be casted into a backend-specific context struct. For zram it would contain struct zram and the bio. For zswap, depending on at which point do you want to begin offloading the work - it could simply be just the folio itself if we offload early, or a more complicated scheme. > > Thanks > > Barry > > Best Regards, > Qun-wei > >
On Mon, Mar 10, 2025 at 9:58 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Mon, Mar 10, 2025 at 6:22 AM Qun-wei Lin (林群崴) > <Qun-wei.Lin@mediatek.com> wrote: > > > > > > Thank you for your explanation. Compared to the original single kswapd, > > we expect t1 to have a slight increase in re-scan time. However, since > > our kcompressd can focus on compression tasks and we can have multiple > > kcompressd instances (kcompressd0, kcompressd1, ...) running in > > parallel, we anticipate that the number of times a folio needs be re- > > scanned will not be too many. > > > > In our experiments, we fixed the CPU and DRAM at a certain frequency. > > We created a high memory pressure enviroment using a memory eater and > > recorded the increase in pgsteal_anon per second, which was around 300, > > 000. Then we applied our patch and measured again, that pgsteal_anon/s > > increased to over 800,000. > > > > > > > > > > > > > > > > Problem: > > > > > In the current system, the kswapd thread is responsible for both > > > > > scanning the LRU pages and compressing pages into the ZRAM. This > > > > > combined responsibility can lead to significant performance > > > > > bottlenecks, > > > > > > > > What bottleneck are we talking about? Is one stage slower than the > > > > other? > > > > > > > > > especially under high memory pressure. The kswapd thread becomes > > > > > a > > > > > single point of contention, causing delays in memory reclaiming > > > > > and > > > > > overall system performance degradation. > > > > > > > > > > Target: > > > > > The target of this invention is to improve the efficiency of > > > > > memory > > > > > reclaiming. By separating the tasks of page scanning and page > > > > > compression into distinct processes or threads, the system can > > > > > handle > > > > > memory pressure more effectively. > > > > > > > > I'm not a zram maintainer, so I'm definitely not trying to stop > > > > this > > > > patch. But whatever problem zram is facing will likely occur with > > > > zswap too, so I'd like to learn more :) > > > > > > Right, this is likely something that could be addressed more > > > generally > > > for zswap and zram. > > > > > > > Yes, we also hope to extend this to other swap devices, but currently, > > we have only modified zram. We are not very familiar with zswap and > > would like to ask if anyone has any suggestions for modifications? > > > > My understanding is right now schedule_bio_write is the work > submission API right? We can make it generic, having it accept a > callback and a generic untyped pointer which can be casted into a > backend-specific context struct. For zram it would contain struct zram > and the bio. For zswap, depending on at which point do you want to > begin offloading the work - it could simply be just the folio itself > if we offload early, or a more complicated scheme. To expand a bit - zswap_store() is where all the logic lives. It's fairly straightforward: checking zswap cgroup limits, acquire the zswap pool (a combination of compression algorithm and backend memory allocator, which is just zsmalloc now), perform compression, then ask for a slot from zsmalloc and store it there. You can probably just offload the whole thing here, or perform some steps of the sequence before offloading the rest :) One slight complication is don't forget to fallback to disk swapping - unlike zram, zswap is originally designed as a "cache" for underlying swap files on disk, which we can fallback to if the compression attempt fails. Everything should be fairly straightforward though :) > > > > > > Thanks > > > Barry > > > > Best Regards, > > Qun-wei > > > >
On (25/03/08 18:41), Barry Song wrote: > On Sat, Mar 8, 2025 at 12:03 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > On Fri, Mar 7, 2025 at 4:02 AM Qun-Wei Lin <qun-wei.lin@mediatek.com> wrote: > > > > > > This patch series introduces a new mechanism called kcompressd to > > > improve the efficiency of memory reclaiming in the operating system. The > > > main goal is to separate the tasks of page scanning and page compression > > > into distinct processes or threads, thereby reducing the load on the > > > kswapd thread and enhancing overall system performance under high memory > > > pressure conditions. > > > > Please excuse my ignorance, but from your cover letter I still don't > > quite get what is the problem here? And how would decouple compression > > and scanning help? > > My understanding is as follows: > > When kswapd attempts to reclaim M anonymous folios and N file folios, > the process involves the following steps: > > * t1: Time to scan and unmap anonymous folios > * t2: Time to compress anonymous folios > * t3: Time to reclaim file folios > > Currently, these steps are executed sequentially, meaning the total time > required to reclaim M + N folios is t1 + t2 + t3. > > However, Qun-Wei's patch enables t1 + t3 and t2 to run in parallel, > reducing the total time to max(t1 + t3, t2). This likely improves the > reclamation speed, potentially reducing allocation stalls. If compression kthread-s can run (have CPUs to be scheduled on). This looks a bit like a bottleneck. Is there anything that guarantees forward progress? Also, if compression kthreads constantly preempt kswapd, then it might not be worth it to have compression kthreads, I assume? If we have a pagefault and need to map a page that is still in the compression queue (not compressed and stored in zram yet, e.g. dut to scheduling latency + slow compression algorithm) then what happens?
On Tue, Mar 11, 2025 at 5:58 PM Sergey Senozhatsky <senozhatsky@chromium.org> wrote: > > On (25/03/08 18:41), Barry Song wrote: > > On Sat, Mar 8, 2025 at 12:03 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > > > On Fri, Mar 7, 2025 at 4:02 AM Qun-Wei Lin <qun-wei.lin@mediatek.com> wrote: > > > > > > > > This patch series introduces a new mechanism called kcompressd to > > > > improve the efficiency of memory reclaiming in the operating system. The > > > > main goal is to separate the tasks of page scanning and page compression > > > > into distinct processes or threads, thereby reducing the load on the > > > > kswapd thread and enhancing overall system performance under high memory > > > > pressure conditions. > > > > > > Please excuse my ignorance, but from your cover letter I still don't > > > quite get what is the problem here? And how would decouple compression > > > and scanning help? > > > > My understanding is as follows: > > > > When kswapd attempts to reclaim M anonymous folios and N file folios, > > the process involves the following steps: > > > > * t1: Time to scan and unmap anonymous folios > > * t2: Time to compress anonymous folios > > * t3: Time to reclaim file folios > > > > Currently, these steps are executed sequentially, meaning the total time > > required to reclaim M + N folios is t1 + t2 + t3. > > > > However, Qun-Wei's patch enables t1 + t3 and t2 to run in parallel, > > reducing the total time to max(t1 + t3, t2). This likely improves the > > reclamation speed, potentially reducing allocation stalls. > > If compression kthread-s can run (have CPUs to be scheduled on). > This looks a bit like a bottleneck. Is there anything that > guarantees forward progress? Also, if compression kthreads > constantly preempt kswapd, then it might not be worth it to > have compression kthreads, I assume? Thanks for your critical insights, all of which are valuable. Qun-Wei is likely working on an Android case where the CPU is relatively idle in many scenarios (though there are certainly cases where all CPUs are busy), but free memory is quite limited. We may soon see benefits for these types of use cases. I expect Android might have the opportunity to adopt it before it's fully ready upstream. If the workload keeps all CPUs busy, I suppose this async thread won’t help, but at least we might find a way to mitigate regression. We likely need to collect more data on various scenarios—when CPUs are relatively idle and when all CPUs are busy—and determine the proper approach based on the data, which we currently lack :-) > > If we have a pagefault and need to map a page that is still in > the compression queue (not compressed and stored in zram yet, e.g. > dut to scheduling latency + slow compression algorithm) then what > happens? This is happening now even without the patch? Right now we are having 4 steps: 1. add_to_swap: The folio is added to the swapcache. 2. try_to_unmap: PTEs are converted to swap entries. 3. pageout: The folio is written back. 4. Swapcache is cleared. If a swap-in occurs between 2 and 4, doesn't that mean we've already encountered the case where we hit the swapcache for a folio undergoing compression? It seems we might have an opportunity to terminate compression if the request is still in the queue and compression hasn’t started for a folio yet? seems quite difficult to do? Thanks Barry
On Tue, 2025-03-11 at 22:33 +1300, Barry Song wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > > > On Tue, Mar 11, 2025 at 5:58 PM Sergey Senozhatsky > <senozhatsky@chromium.org> wrote: > > > > On (25/03/08 18:41), Barry Song wrote: > > > On Sat, Mar 8, 2025 at 12:03 PM Nhat Pham <nphamcs@gmail.com> > > > wrote: > > > > > > > > On Fri, Mar 7, 2025 at 4:02 AM Qun-Wei Lin > > > > <qun-wei.lin@mediatek.com> wrote: > > > > > > > > > > This patch series introduces a new mechanism called > > > > > kcompressd to > > > > > improve the efficiency of memory reclaiming in the operating > > > > > system. The > > > > > main goal is to separate the tasks of page scanning and page > > > > > compression > > > > > into distinct processes or threads, thereby reducing the load > > > > > on the > > > > > kswapd thread and enhancing overall system performance under > > > > > high memory > > > > > pressure conditions. > > > > > > > > Please excuse my ignorance, but from your cover letter I still > > > > don't > > > > quite get what is the problem here? And how would decouple > > > > compression > > > > and scanning help? > > > > > > My understanding is as follows: > > > > > > When kswapd attempts to reclaim M anonymous folios and N file > > > folios, > > > the process involves the following steps: > > > > > > * t1: Time to scan and unmap anonymous folios > > > * t2: Time to compress anonymous folios > > > * t3: Time to reclaim file folios > > > > > > Currently, these steps are executed sequentially, meaning the > > > total time > > > required to reclaim M + N folios is t1 + t2 + t3. > > > > > > However, Qun-Wei's patch enables t1 + t3 and t2 to run in > > > parallel, > > > reducing the total time to max(t1 + t3, t2). This likely improves > > > the > > > reclamation speed, potentially reducing allocation stalls. > > > > If compression kthread-s can run (have CPUs to be scheduled on). > > This looks a bit like a bottleneck. Is there anything that > > guarantees forward progress? Also, if compression kthreads > > constantly preempt kswapd, then it might not be worth it to > > have compression kthreads, I assume? > > Thanks for your critical insights, all of which are valuable. > > Qun-Wei is likely working on an Android case where the CPU is > relatively idle in many scenarios (though there are certainly cases > where all CPUs are busy), but free memory is quite limited. > We may soon see benefits for these types of use cases. I expect > Android might have the opportunity to adopt it before it's fully > ready upstream. > > If the workload keeps all CPUs busy, I suppose this async thread > won’t help, but at least we might find a way to mitigate regression. > > We likely need to collect more data on various scenarios—when > CPUs are relatively idle and when all CPUs are busy—and > determine the proper approach based on the data, which we > currently lack :-) > Thanks for the explaining! > > > > If we have a pagefault and need to map a page that is still in > > the compression queue (not compressed and stored in zram yet, e.g. > > dut to scheduling latency + slow compression algorithm) then what > > happens? > > This is happening now even without the patch? Right now we are > having 4 steps: > 1. add_to_swap: The folio is added to the swapcache. > 2. try_to_unmap: PTEs are converted to swap entries. > 3. pageout: The folio is written back. > 4. Swapcache is cleared. > > If a swap-in occurs between 2 and 4, doesn't that mean > we've already encountered the case where we hit > the swapcache for a folio undergoing compression? > > It seems we might have an opportunity to terminate > compression if the request is still in the queue and > compression hasn’t started for a folio yet? seems > quite difficult to do? As Barry explained, these folios that are being compressed are in the swapcache. If a refault occurs during the compression process, its correctness is already guaranteed by the swap subsystem (similar to other asynchronous swap devices). Indeed, terminating a folio that is already in the queue waiting for compression is a challenging task. Will this require some modifications to the current architecture of swap subsystem? > > Thanks > Barry Best Regards, Qun-wei
On (25/03/11 14:12), Qun-wei Lin (林群崴) wrote: > > > If compression kthread-s can run (have CPUs to be scheduled on). > > > This looks a bit like a bottleneck. Is there anything that > > > guarantees forward progress? Also, if compression kthreads > > > constantly preempt kswapd, then it might not be worth it to > > > have compression kthreads, I assume? > > > > Thanks for your critical insights, all of which are valuable. > > > > Qun-Wei is likely working on an Android case where the CPU is > > relatively idle in many scenarios (though there are certainly cases > > where all CPUs are busy), but free memory is quite limited. > > We may soon see benefits for these types of use cases. I expect > > Android might have the opportunity to adopt it before it's fully > > ready upstream. > > > > If the workload keeps all CPUs busy, I suppose this async thread > > won’t help, but at least we might find a way to mitigate regression. > > > > We likely need to collect more data on various scenarios—when > > CPUs are relatively idle and when all CPUs are busy—and > > determine the proper approach based on the data, which we > > currently lack :-) Right. The scan/unmap can move very fast (a rabbit) while the compressor can move rather slow (a tortoise.) There is some benefit in the fact that kswap does compression directly, I'd presume. Another thing to consider, perhaps, is that not every page is necessarily required to go through the compressor queue and stay there until the woken-up compressor finally picks it up just to realize that the page is filled with 0xff (or any other pattern). At least on the zram side such pages are not compressed and stored as an 8-byte pattern in the zram meta table (w/o using any zsmalloc memory.) > > > If we have a pagefault and need to map a page that is still in > > > the compression queue (not compressed and stored in zram yet, e.g. > > > dut to scheduling latency + slow compression algorithm) then what > > > happens? > > > > This is happening now even without the patch? Right now we are > > having 4 steps: > > 1. add_to_swap: The folio is added to the swapcache. > > 2. try_to_unmap: PTEs are converted to swap entries. > > 3. pageout: The folio is written back. > > 4. Swapcache is cleared. > > > > If a swap-in occurs between 2 and 4, doesn't that mean > > we've already encountered the case where we hit > > the swapcache for a folio undergoing compression? > > > > It seems we might have an opportunity to terminate > > compression if the request is still in the queue and > > compression hasn’t started for a folio yet? seems > > quite difficult to do? > > As Barry explained, these folios that are being compressed are in the > swapcache. If a refault occurs during the compression process, its > correctness is already guaranteed by the swap subsystem (similar to > other asynchronous swap devices). Right. I just was thinking that now there is a wake_up between scan/unmap and compress. Not sure how much trouble this can make. > Indeed, terminating a folio that is already in the queue waiting for > compression is a challenging task. Will this require some modifications > to the current architecture of swap subsystem? Yeah, I'll leave it mm folks to decide :)