Message ID | 20231024142706.195517-1-hezhongkun.hzk@bytedance.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: zswap: fix the lack of page lru flag in zswap_writeback_entry | expand |
On Tue, 24 Oct 2023 22:27:06 +0800 Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > The zswap_writeback_entry() will add a page to the swap cache, decompress > the entry data into the page, and issue a bio write to write the page back > to the swap device. Move the page to the tail of lru list through > SetPageReclaim(page) and folio_rotate_reclaimable(). > > Currently, about half of the pages will fail to move to the tail of lru > list because there is no LRU flag in page which is not in the LRU list but > the cpu_fbatches. So fix it. Some reviewer input on this change would be helpful, please. Also, is there any measurable change in overall performance? > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1139,6 +1139,11 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > /* move it to the tail of the inactive list after end_writeback */ > SetPageReclaim(page); > > + if (!PageLRU(page)) { > + /* drain lru cache to help folio_rotate_reclaimable() */ > + lru_add_drain(); > + } > + > /* start writeback */ > __swap_writepage(page, &wbc); > put_page(page); > -- > 2.25.1 >
On Tue, Oct 24, 2023 at 7:27 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > My apologies for the delayed response. I have a couple of questions. > The zswap_writeback_entry() will add a page to the swap cache, decompress > the entry data into the page, and issue a bio write to write the page back > to the swap device. Move the page to the tail of lru list through > SetPageReclaim(page) and folio_rotate_reclaimable(). > > Currently, about half of the pages will fail to move to the tail of lru May I ask what's the downstream effect of this? i.e so what if it fails? And yes, as Andrew pointed out, it'd be nice if the patch changelog spells out any observable or measurable change from the user's POV. > list because there is no LRU flag in page which is not in the LRU list but > the cpu_fbatches. So fix it. This sentence is a bit confusing to me. Does this mean the page currently being processed for writeback is not in the LRU list (!PageLRU(page)), but IN one of the cpu folio batches? Which makes folio_rotate_reclaimable() fails on this page later on in the _swap_writepage() path? (hence the necessity of lru_add_drain()?) Let me know if I'm misunderstanding the intention of this patch. I know it's a bit pedantic, but spelling things out (ideally in the changelog itself) will help the reviewers, as well as future contributors who want to study the codebase and make changes to it. > > Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com> Thanks and look forward to your response, Nhat P/S: Have a nice holiday season and happy new year!
On Sat, Dec 30, 2023 at 10:09 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Tue, Oct 24, 2023 at 7:27 AM Zhongkun He > <hezhongkun.hzk@bytedance.com> wrote: > > > > My apologies for the delayed response. I have a couple of questions. > > > The zswap_writeback_entry() will add a page to the swap cache, decompress > > the entry data into the page, and issue a bio write to write the page back > > to the swap device. Move the page to the tail of lru list through > > SetPageReclaim(page) and folio_rotate_reclaimable(). > > > > Currently, about half of the pages will fail to move to the tail of lru > > May I ask what's the downstream effect of this? i.e so what if it > fails? And yes, as Andrew pointed out, it'd be nice if the patch > changelog spells out any observable or measurable change from the > user's POV. > The swap cache page used to decompress zswap_entry should be moved to the tail of the inactive list after end_writeback, We can release them in time.Just like the following code in zswap_writeback_entry(). /* move it to the tail of the inactive list after end_writeback */ SetPageReclaim(page); After the writeback is over, the function of folio_rotate_reclaimable() will fail because the page is not in the LRU list but in some of the cpu folio batches. Therefore we did not achieve the goal of setting SetPageReclaim(page), and the pages could not be free in time. > > list because there is no LRU flag in page which is not in the LRU list but > > the cpu_fbatches. So fix it. > > This sentence is a bit confusing to me. Does this mean the page > currently being processed for writeback is not in the LRU list > (!PageLRU(page)), but IN one of the cpu folio batches? Which makes > folio_rotate_reclaimable() fails on this page later on in the > _swap_writepage() path? (hence the necessity of lru_add_drain()?) > Yes, exactly. > Let me know if I'm misunderstanding the intention of this patch. I > know it's a bit pedantic, but spelling things out (ideally in the > changelog itself) will help the reviewers, as well as future > contributors who want to study the codebase and make changes to it. > Sorry,my bad. > > > > Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com> > > Thanks and look forward to your response, > Nhat > > P/S: Have a nice holiday season and happy new year! Here are the steps and results of the performance test: 1:zswap+ zram (simplified model with on IO) 2:disabel zswap/parameters/same_filled_pages_enabled (stress have same pages) 3:stress --vm 1 --vm-bytes 2g --vm-hang 0 (2Gi anon pages) 4: In order to quickly release zswap_entry, I used the previous patch (I will send it again later). https://lore.kernel.org/all/20231025095248.458789-1-hezhongkun.hzk@bytedance.com/ Performance result: reclaim 1Gi zswap_entry time echo 1 > writeback_time_threshold (will release the zswap_entry, not been accessed for more than 1 seconds ) Base With this patch real 0m1.015s real 0m1.043s user 0m0.000s user 0m0.001s sys 0m1.013s sys 0m1.040s So no obvious performance regression was found. After writeback, we perform the following steps to release the memory again echo 1g > memory.reclaim Base: total used recalim total used Mem: 38Gi 2.5Gi ----> 38Gi 1.5Gi Swap: 5.0Gi 1.0Gi ----> 5Gi 1.5Gi used memory -1G swap +0.5g It means that half of the pages are failed to move to the tail of lru list, So we need to release an additional 0.5Gi anon pages to swap space. With this patch: total used recalim total used Mem: 38Gi 2.6Gi ----> 38Gi 1.6Gi Swap: 5.0Gi 1.0Gi ----> 5Gi 1Gi used memory -1Gi, swap +0Gi It means that we release all the pages which have been add to the tail of lru list in zswap_writeback_entry() and folio_rotate_reclaimable(). Thanks for your time Nhat and Andrew. Happy New Year!
> > Base: > total used recalim total used > Mem: 38Gi 2.5Gi ----> 38Gi 1.5Gi > Swap: 5.0Gi 1.0Gi ----> 5Gi 1.5Gi > used memory -1G swap +0.5g > It means that half of the pages are failed to move to the tail of lru list, > So we need to release an additional 0.5Gi anon pages to swap space. Based on the results of multiple tests, the following information is updated. Base: total used recalim total used Mem: 38Gi 2.6Gi ----> 38Gi 1.6Gi Swap: 5.0Gi 1.0Gi ----> 5Gi 1.9Gi used memory -1G swap +0.9g It means that most of the pages failed to move to the tail of the lru list, So we need to release an additional 0.9Gi anon pages to swap space. Thanks again.
On Tue, Jan 2, 2024 at 3:39 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > On Sat, Dec 30, 2023 at 10:09 AM Nhat Pham <nphamcs@gmail.com> wrote: > > > > On Tue, Oct 24, 2023 at 7:27 AM Zhongkun He > > <hezhongkun.hzk@bytedance.com> wrote: > > > > > > > My apologies for the delayed response. I have a couple of questions. > > > > > The zswap_writeback_entry() will add a page to the swap cache, decompress > > > the entry data into the page, and issue a bio write to write the page back > > > to the swap device. Move the page to the tail of lru list through > > > SetPageReclaim(page) and folio_rotate_reclaimable(). > > > > > > Currently, about half of the pages will fail to move to the tail of lru > > > > May I ask what's the downstream effect of this? i.e so what if it > > fails? And yes, as Andrew pointed out, it'd be nice if the patch > > changelog spells out any observable or measurable change from the > > user's POV. > > > > The swap cache page used to decompress zswap_entry should be > moved to the tail of the inactive list after end_writeback, We can release > them in time.Just like the following code in zswap_writeback_entry(). > > /* move it to the tail of the inactive list after end_writeback */ > SetPageReclaim(page); > > After the writeback is over, the function of > folio_rotate_reclaimable() will fail > because the page is not in the LRU list but in some of the cpu folio batches. > Therefore we did not achieve the goal of setting SetPageReclaim(page), and > the pages could not be free in time. > > > > list because there is no LRU flag in page which is not in the LRU list but > > > the cpu_fbatches. So fix it. > > > > This sentence is a bit confusing to me. Does this mean the page > > currently being processed for writeback is not in the LRU list > > (!PageLRU(page)), but IN one of the cpu folio batches? Which makes > > folio_rotate_reclaimable() fails on this page later on in the > > _swap_writepage() path? (hence the necessity of lru_add_drain()?) > > > > Yes, exactly. > > > Let me know if I'm misunderstanding the intention of this patch. I > > know it's a bit pedantic, but spelling things out (ideally in the > > changelog itself) will help the reviewers, as well as future > > contributors who want to study the codebase and make changes to it. > > > > Sorry,my bad. > > > > > > > Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com> > > > > Thanks and look forward to your response, > > Nhat > > > > P/S: Have a nice holiday season and happy new year! > > Here are the steps and results of the performance test: > 1:zswap+ zram (simplified model with on IO) > 2:disabel zswap/parameters/same_filled_pages_enabled (stress have same pages) > 3:stress --vm 1 --vm-bytes 2g --vm-hang 0 (2Gi anon pages) > 4: In order to quickly release zswap_entry, I used the previous > patch (I will send it again later). > https://lore.kernel.org/all/20231025095248.458789-1-hezhongkun.hzk@bytedance.com/ > > Performance result: > reclaim 1Gi zswap_entry > > time echo 1 > writeback_time_threshold > (will release the zswap_entry, not been accessed for more than 1 seconds ) > > Base With this patch > real 0m1.015s real 0m1.043s > user 0m0.000s user 0m0.001s > sys 0m1.013s sys 0m1.040s > So no obvious performance regression was found. That's around 2.7% increase in real time, no? Admittedly, this micro-benchmark is too small to conclude either way, but the data doesn't seem to be in your favor. I'm a bit concerned about the overhead here, given that with this patch we will drain the per-cpu batch on every written-back entry. That's quite a high frequency, especially since we're moving towards more writeback (either with the new zswap shrinker, or your time threshold-based writeback mechanism). For instance, there seems to be some (local/per-cpu) locking involved, no? Could there be some form of lock contentions there (especially since with the new shrinker, you can have a lot of concurrent writeback contexts?) Furthermore, note that a writeback from zswap to swap saves less memory than a writeback from memory to swap, so the effect of the extra overhead will be even more pronounced. That is, the amount of extra work done (from this change) to save one unit of memory would be even larger than if we call lru_add_drain() every time we swap out a page (from memory -> swap). And I'm pretty sure we don't call lru_add_drain() every time we swap out a page - I believe we call lru_add_drain() every time we perform a shrink action. For e.g, in shrink_inactive_list(). That's much coarser in granularity than here. Also, IIUC, the more often we perform lru_add_drain(), the less batching effect we will obtain. IOW, the overhead of maintaining the batch will become higher relative to the performance gains from batching. Maybe I'm missing something - could you walk me through how lru_add_drain() is fine here, from this POV? Thanks! > > After writeback, we perform the following steps to release the memory again > echo 1g > memory.reclaim > > Base: > total used recalim total used > Mem: 38Gi 2.5Gi ----> 38Gi 1.5Gi > Swap: 5.0Gi 1.0Gi ----> 5Gi 1.5Gi > used memory -1G swap +0.5g > It means that half of the pages are failed to move to the tail of lru list, > So we need to release an additional 0.5Gi anon pages to swap space. > > With this patch: > total used recalim total used > Mem: 38Gi 2.6Gi ----> 38Gi 1.6Gi > Swap: 5.0Gi 1.0Gi ----> 5Gi 1Gi > > used memory -1Gi, swap +0Gi > It means that we release all the pages which have been add to the tail of > lru list in zswap_writeback_entry() and folio_rotate_reclaimable(). > OTOH, this suggests that we're onto something. Swap usage seems to decrease quite a bit. Sounds like a real problem that this patch is tackling. (Please add this benchmark result to future changelog. It'll help demonstrate the problem). I'm inclined to ack this patch, but it'd be nice if you can assuage my concerns above (with some justification and/or larger benchmark). (Or perhaps, we have to drain, but less frequently/higher up the stack?) Thanks, Nhat > > Thanks for your time Nhat and Andrew. Happy New Year!
> That's around 2.7% increase in real time, no? Admittedly, this > micro-benchmark is too small to conclude either way, but the data > doesn't seem to be in your favor. > > I'm a bit concerned about the overhead here, given that with this > patch we will drain the per-cpu batch on every written-back entry. > That's quite a high frequency, especially since we're moving towards > more writeback (either with the new zswap shrinker, or your time > threshold-based writeback mechanism). For instance, there seems to be > some (local/per-cpu) locking involved, no? Could there be some form of > lock contentions there (especially since with the new shrinker, you > can have a lot of concurrent writeback contexts?) > > Furthermore, note that a writeback from zswap to swap saves less > memory than a writeback from memory to swap, so the effect of the > extra overhead will be even more pronounced. That is, the amount of > extra work done (from this change) to save one unit of memory would be > even larger than if we call lru_add_drain() every time we swap out a > page (from memory -> swap). And I'm pretty sure we don't call > lru_add_drain() every time we swap out a page - I believe we call > lru_add_drain() every time we perform a shrink action. For e.g, in > shrink_inactive_list(). That's much coarser in granularity than here. > > Also, IIUC, the more often we perform lru_add_drain(), the less > batching effect we will obtain. IOW, the overhead of maintaining the > batch will become higher relative to the performance gains from > batching. > > Maybe I'm missing something - could you walk me through how > lru_add_drain() is fine here, from this POV? Thanks! > > > > > After writeback, we perform the following steps to release the memory again > > echo 1g > memory.reclaim > > > > Base: > > total used recalim total used > > Mem: 38Gi 2.5Gi ----> 38Gi 1.5Gi > > Swap: 5.0Gi 1.0Gi ----> 5Gi 1.5Gi > > used memory -1G swap +0.5g > > It means that half of the pages are failed to move to the tail of lru list, > > So we need to release an additional 0.5Gi anon pages to swap space. > > > > With this patch: > > total used recalim total used > > Mem: 38Gi 2.6Gi ----> 38Gi 1.6Gi > > Swap: 5.0Gi 1.0Gi ----> 5Gi 1Gi > > > > used memory -1Gi, swap +0Gi > > It means that we release all the pages which have been add to the tail of > > lru list in zswap_writeback_entry() and folio_rotate_reclaimable(). > > > > OTOH, this suggests that we're onto something. Swap usage seems to > decrease quite a bit. Sounds like a real problem that this patch is > tackling. > (Please add this benchmark result to future changelog. It'll help > demonstrate the problem). Yes > > I'm inclined to ack this patch, but it'd be nice if you can assuage my > concerns above (with some justification and/or larger benchmark). > OK,thanks. > (Or perhaps, we have to drain, but less frequently/higher up the stack?) > I've reviewed the code again and have no idea. It would be better if you have any suggestions. New test: This patch will add the execution of folio_rotate_reclaimable(not executed without this patch) and lru_add_drain,including percpu lock competition. I bind a new task to allocate memory and use the same batch lock to compete with the target process, on the same CPU. context: 1:stress --vm 1 --vm-bytes 1g (bind to cpu0) 2:stress --vm 1 --vm-bytes 5g --vm-hang 0(bind to cpu0) 3:reclaim pages, and writeback 5G zswap_entry in cpu0 and node 0. Average time of five tests Base patch patch + compete 4.947 5.0676 5.1336 +2.4% +3.7% compete means: a new stress run in cpu0 to compete with the writeback process. PID USER %CPU %MEM TIME+ COMMAND P 1367 root 49.5 0.0 1:09.17 bash (writeback) 0 1737 root 49.5 2.2 0:27.46 stress (use percpu lock) 0 around 2.4% increase in real time,including the execution of folio_rotate_reclaimable(not executed without this patch) and lru_add_drain,but no lock contentions. around 1.3% additional increase in real time with lock contentions on the same cpu. There is another option here, which is not to move the page to the tail of the inactive list after end_writeback and delete the following code in zswap_writeback_entry(), which did not work properly. But the pages will not be released first. /* move it to the tail of the inactive list after end_writeback */ SetPageReclaim(page); Thanks, Zhongkun > Thanks, > Nhat > > > > > Thanks for your time Nhat and Andrew. Happy New Year!
On Wed, Jan 3, 2024 at 6:12 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > > That's around 2.7% increase in real time, no? Admittedly, this > > micro-benchmark is too small to conclude either way, but the data > > doesn't seem to be in your favor. > > > > I'm a bit concerned about the overhead here, given that with this > > patch we will drain the per-cpu batch on every written-back entry. > > That's quite a high frequency, especially since we're moving towards > > more writeback (either with the new zswap shrinker, or your time > > threshold-based writeback mechanism). For instance, there seems to be > > some (local/per-cpu) locking involved, no? Could there be some form of > > lock contentions there (especially since with the new shrinker, you > > can have a lot of concurrent writeback contexts?) > > > > Furthermore, note that a writeback from zswap to swap saves less > > memory than a writeback from memory to swap, so the effect of the > > extra overhead will be even more pronounced. That is, the amount of > > extra work done (from this change) to save one unit of memory would be > > even larger than if we call lru_add_drain() every time we swap out a > > page (from memory -> swap). And I'm pretty sure we don't call > > lru_add_drain() every time we swap out a page - I believe we call > > lru_add_drain() every time we perform a shrink action. For e.g, in > > shrink_inactive_list(). That's much coarser in granularity than here. > > > > Also, IIUC, the more often we perform lru_add_drain(), the less > > batching effect we will obtain. IOW, the overhead of maintaining the > > batch will become higher relative to the performance gains from > > batching. > > > > Maybe I'm missing something - could you walk me through how > > lru_add_drain() is fine here, from this POV? Thanks! > > > > > > > > After writeback, we perform the following steps to release the memory again > > > echo 1g > memory.reclaim > > > > > > Base: > > > total used recalim total used > > > Mem: 38Gi 2.5Gi ----> 38Gi 1.5Gi > > > Swap: 5.0Gi 1.0Gi ----> 5Gi 1.5Gi > > > used memory -1G swap +0.5g > > > It means that half of the pages are failed to move to the tail of lru list, > > > So we need to release an additional 0.5Gi anon pages to swap space. > > > > > > With this patch: > > > total used recalim total used > > > Mem: 38Gi 2.6Gi ----> 38Gi 1.6Gi > > > Swap: 5.0Gi 1.0Gi ----> 5Gi 1Gi > > > > > > used memory -1Gi, swap +0Gi > > > It means that we release all the pages which have been add to the tail of > > > lru list in zswap_writeback_entry() and folio_rotate_reclaimable(). > > > > > > > OTOH, this suggests that we're onto something. Swap usage seems to > > decrease quite a bit. Sounds like a real problem that this patch is > > tackling. > > (Please add this benchmark result to future changelog. It'll help > > demonstrate the problem). > > Yes > > > > > I'm inclined to ack this patch, but it'd be nice if you can assuage my > > concerns above (with some justification and/or larger benchmark). > > > > OK,thanks. > > > (Or perhaps, we have to drain, but less frequently/higher up the stack?) > > > > I've reviewed the code again and have no idea. It would be better if > you have any suggestions. Hmm originally I was thinking of doing an (unconditional) lru_add_drain() outside of zswap_writeback_entry() - once in shrink_worker() and/or zswap_shrinker_scan(), before we write back any of the entries. Not sure if it would work/help here tho - haven't tested that idea yet. > > New test: > This patch will add the execution of folio_rotate_reclaimable(not executed > without this patch) and lru_add_drain,including percpu lock competition. > I bind a new task to allocate memory and use the same batch lock to compete > with the target process, on the same CPU. > context: > 1:stress --vm 1 --vm-bytes 1g (bind to cpu0) > 2:stress --vm 1 --vm-bytes 5g --vm-hang 0(bind to cpu0) > 3:reclaim pages, and writeback 5G zswap_entry in cpu0 and node 0. > > Average time of five tests > > Base patch patch + compete > 4.947 5.0676 5.1336 > +2.4% +3.7% > compete means: a new stress run in cpu0 to compete with the writeback process. > PID USER %CPU %MEM TIME+ COMMAND P > 1367 root 49.5 0.0 1:09.17 bash (writeback) 0 > 1737 root 49.5 2.2 0:27.46 stress (use percpu > lock) 0 > > around 2.4% increase in real time,including the execution of > folio_rotate_reclaimable(not executed without this patch) and lru_add_drain,but > no lock contentions. Hmm looks like the regression is still there, no? > > around 1.3% additional increase in real time with lock contentions on the same > cpu. > > There is another option here, which is not to move the page to the > tail of the inactive > list after end_writeback and delete the following code in > zswap_writeback_entry(), > which did not work properly. But the pages will not be released first. > > /* move it to the tail of the inactive list after end_writeback */ > SetPageReclaim(page); Or only SetPageReclaim on pages on LRU? > > Thanks, > Zhongkun > > > Thanks, > > Nhat > > > > > > > > Thanks for your time Nhat and Andrew. Happy New Year!
> > Hmm originally I was thinking of doing an (unconditional) > lru_add_drain() outside of zswap_writeback_entry() - once in > shrink_worker() and/or zswap_shrinker_scan(), before we write back any > of the entries. Not sure if it would work/help here tho - haven't > tested that idea yet. > The pages are allocated by __read_swap_cache_async() in zswap_writeback_entry() and it must be newly allocated, not cached in swap. Please see the code below in zswap_writeback_entry() page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol, NO_INTERLEAVE_INDEX, &page_was_allocated); if (!page) { goto fail;} /* Found an existing page, we raced with load/swapin */ if (!page_was_allocated) { put_page(page); ret = -EEXIST; goto fail; } So when it comes to SetPageReclaim(page), The page has just been allocated and is still in the percpu batch, which has not been added to the LRU. Therefore,lru_add_drain() did not work outside the zswap_writeback_entry() > > > > New test: > > This patch will add the execution of folio_rotate_reclaimable(not executed > > without this patch) and lru_add_drain,including percpu lock competition. > > I bind a new task to allocate memory and use the same batch lock to compete > > with the target process, on the same CPU. > > context: > > 1:stress --vm 1 --vm-bytes 1g (bind to cpu0) > > 2:stress --vm 1 --vm-bytes 5g --vm-hang 0(bind to cpu0) > > 3:reclaim pages, and writeback 5G zswap_entry in cpu0 and node 0. > > > > Average time of five tests > > > > Base patch patch + compete > > 4.947 5.0676 5.1336 > > +2.4% +3.7% > > compete means: a new stress run in cpu0 to compete with the writeback process. > > PID USER %CPU %MEM TIME+ COMMAND P > > 1367 root 49.5 0.0 1:09.17 bash (writeback) 0 > > 1737 root 49.5 2.2 0:27.46 stress (use percpu > > lock) 0 > > > > around 2.4% increase in real time,including the execution of > > folio_rotate_reclaimable(not executed without this patch) and lru_add_drain,but > > no lock contentions. > > Hmm looks like the regression is still there, no? Yes, it cannot be eliminated. > > > > > around 1.3% additional increase in real time with lock contentions on the same > > cpu. > > > > There is another option here, which is not to move the page to the > > tail of the inactive > > list after end_writeback and delete the following code in > > zswap_writeback_entry(), > > which did not work properly. But the pages will not be released first. > > > > /* move it to the tail of the inactive list after end_writeback */ > > SetPageReclaim(page); > > Or only SetPageReclaim on pages on LRU? No, all the pages are newly allocted and not on LRU. This patch should add lru_add_drain() directly, remove the if statement. The purpose of writing back data is to release the page, so I think it is necessary to fix it. Thanks for your time, Nhat.
On Fri, Jan 5, 2024 at 6:10 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > > > > Hmm originally I was thinking of doing an (unconditional) > > lru_add_drain() outside of zswap_writeback_entry() - once in > > shrink_worker() and/or zswap_shrinker_scan(), before we write back any > > of the entries. Not sure if it would work/help here tho - haven't > > tested that idea yet. > > > > The pages are allocated by __read_swap_cache_async() in > zswap_writeback_entry() and it must be newly allocated, > not cached in swap. > Please see the code below in zswap_writeback_entry() > > page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol, > NO_INTERLEAVE_INDEX, &page_was_allocated); > if (!page) { > goto fail;} > /* Found an existing page, we raced with load/swapin */ > if (!page_was_allocated) { > put_page(page); > ret = -EEXIST; > goto fail; > } > > So when it comes to SetPageReclaim(page), > The page has just been allocated and is still in the percpu batch, > which has not been added to the LRU. > > Therefore,lru_add_drain() did not work outside the > zswap_writeback_entry() You're right hmmm. > > > > > > > New test: > > > This patch will add the execution of folio_rotate_reclaimable(not executed > > > without this patch) and lru_add_drain,including percpu lock competition. > > > I bind a new task to allocate memory and use the same batch lock to compete > > > with the target process, on the same CPU. > > > context: > > > 1:stress --vm 1 --vm-bytes 1g (bind to cpu0) > > > 2:stress --vm 1 --vm-bytes 5g --vm-hang 0(bind to cpu0) > > > 3:reclaim pages, and writeback 5G zswap_entry in cpu0 and node 0. > > > > > > Average time of five tests > > > > > > Base patch patch + compete > > > 4.947 5.0676 5.1336 > > > +2.4% +3.7% > > > compete means: a new stress run in cpu0 to compete with the writeback process. > > > PID USER %CPU %MEM TIME+ COMMAND P > > > 1367 root 49.5 0.0 1:09.17 bash (writeback) 0 > > > 1737 root 49.5 2.2 0:27.46 stress (use percpu > > > lock) 0 > > > > > > around 2.4% increase in real time,including the execution of > > > folio_rotate_reclaimable(not executed without this patch) and lru_add_drain,but > > > no lock contentions. > > > > Hmm looks like the regression is still there, no? > > Yes, it cannot be eliminated. Yeah this solution is not convincing to me. The overall effect of this patch is we're trading runtime to save some swap usage. That seems backward to me? Are there any other observable benefits to this? Otherwise, unless the benchmark is an adversarial workload that is not representative of the common case (and you'll need to show me some alternative benchmark numbers or justifications to convince me here), IMHO this is a risky change to merge. This is not a feature that is gated behind a flag that users can safely ignore/disable. > > > > > > > > > around 1.3% additional increase in real time with lock contentions on the same > > > cpu. > > > > > > There is another option here, which is not to move the page to the > > > tail of the inactive > > > list after end_writeback and delete the following code in > > > zswap_writeback_entry(), > > > which did not work properly. But the pages will not be released first. > > > > > > /* move it to the tail of the inactive list after end_writeback */ > > > SetPageReclaim(page); > > > > Or only SetPageReclaim on pages on LRU? > > No, all the pages are newly allocted and not on LRU. > > This patch should add lru_add_drain() directly, remove the > if statement. > The purpose of writing back data is to release the page, > so I think it is necessary to fix it. > > Thanks for your time, Nhat. Is there a way to detect these kinds of folios at folio_rotate_reclaimable() callsite? i.e if this is a zswap-owned page etc. etc.?
On Fri, Jan 5, 2024 at 6:10 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > > > There is another option here, which is not to move the page to the > > > tail of the inactive > > > list after end_writeback and delete the following code in > > > zswap_writeback_entry(), > > > which did not work properly. But the pages will not be released first. > > > > > > /* move it to the tail of the inactive list after end_writeback */ > > > SetPageReclaim(page); Ok, so I took a look at the patch that originally introduced this piece of logic: https://github.com/torvalds/linux/commit/b349acc76b7f65400b85abd09a5379ddd6fa5a97 Looks like it's not for the sake of correctness, but only as a best-effort optimization (reducing page scanning). If it doesn't bring any benefit (i.e due to the newly allocated page still on the cpu batch), then we can consider removing it. After all, if you're right and it's not really doing anything here - why bother. Perhaps we can replace this with some other mechanism to avoid it being scanned for reclaim. I would cc Weijie as well, as he is the original author of this.
On Sun, Jan 7, 2024 at 1:29 PM Nhat Pham <nphamcs@gmail.com> wrote: > > On Fri, Jan 5, 2024 at 6:10 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > > > > > There is another option here, which is not to move the page to the > > > > tail of the inactive > > > > list after end_writeback and delete the following code in > > > > zswap_writeback_entry(), > > > > which did not work properly. But the pages will not be released first. > > > > > > > > /* move it to the tail of the inactive list after end_writeback */ > > > > SetPageReclaim(page); > > > Ok, so I took a look at the patch that originally introduced this > piece of logic: > > https://github.com/torvalds/linux/commit/b349acc76b7f65400b85abd09a5379ddd6fa5a97 > > Looks like it's not for the sake of correctness, but only as a > best-effort optimization (reducing page scanning). If it doesn't bring > any benefit (i.e due to the newly allocated page still on the cpu > batch), then we can consider removing it. After all, if you're right > and it's not really doing anything here - why bother. Perhaps we can > replace this with some other mechanism to avoid it being scanned for > reclaim. For instance, we can grab the local lock, look for the folio in the add batch and take the folio off it, then add it to the rotate batch instead? Not sure if this is doable within folio_rotate_reclaimable(), or you'll have to manually perform this yourself (and remove the PG_reclaim flag set here so that folio_end_writeback() doesn't try to handle it). There is still some overhead with this, but at least we don't have to *drain everything* (which looks like what's lru_add_drain() -> lru_add_drain_cpu() is doing). The latter sounds expensive and unnecessary, whereas this is just one element addition and one element removal - and if IIUC the size of the per-cpu add batch is capped at 15, so lookup + removal (if possible) shouldn't be too expensive? Just throwing ideas out there :) > > I would cc Weijie as well, as he is the original author of this.
On Sun, Jan 7, 2024 at 1:59 PM Nhat Pham <nphamcs@gmail.com> wrote: > > On Sun, Jan 7, 2024 at 1:29 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > On Fri, Jan 5, 2024 at 6:10 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > > > > > > > There is another option here, which is not to move the page to the > > > > > tail of the inactive > > > > > list after end_writeback and delete the following code in > > > > > zswap_writeback_entry(), > > > > > which did not work properly. But the pages will not be released first. > > > > > > > > > > /* move it to the tail of the inactive list after end_writeback */ > > > > > SetPageReclaim(page); > > > > > > Ok, so I took a look at the patch that originally introduced this > > piece of logic: > > > > https://github.com/torvalds/linux/commit/b349acc76b7f65400b85abd09a5379ddd6fa5a97 > > > > Looks like it's not for the sake of correctness, but only as a > > best-effort optimization (reducing page scanning). If it doesn't bring > > any benefit (i.e due to the newly allocated page still on the cpu > > batch), then we can consider removing it. After all, if you're right > > and it's not really doing anything here - why bother. Perhaps we can > > replace this with some other mechanism to avoid it being scanned for > > reclaim. > > For instance, we can grab the local lock, look for the folio in the > add batch and take the folio off it, then add it to the rotate batch > instead? Not sure if this is doable within folio_rotate_reclaimable(), > or you'll have to manually perform this yourself (and remove the > PG_reclaim flag set here so that folio_end_writeback() doesn't try to > handle it). > > There is still some overhead with this, but at least we don't have to > *drain everything* (which looks like what's lru_add_drain() -> > lru_add_drain_cpu() is doing). The latter sounds expensive and > unnecessary, whereas this is just one element addition and one element > removal - and if IIUC the size of the per-cpu add batch is capped at > 15, so lookup + removal (if possible) shouldn't be too expensive? > > Just throwing ideas out there :) Sorry for being late to the party. It seems to me that all of this hassle can be avoided if lru_add_fn() did the right thing in this case and added the folio to the tail of the lru directly. I am no expert in how the page flags work here, but it seems like we can do something like this in lru_add_fn(): if (folio_test_reclaim(folio)) lruvec_add_folio_tail(lruvec, folio); else lruvec_add_folio(lruvec, folio); I think the main problem with this is that PG_reclaim is an alias to PG_readahead, so readahead pages will also go to the tail of the lru, which is probably not good. A more intrusive alternative is to introduce a folio_lru_add_tail() variant that always adds pages to the tail, and optionally call that from __read_swap_cache_async() instead of folio_lru_add() based on a new boolean argument. The zswap code can set that boolean argument during writeback to make sure newly allocated folios are always added to the tail of the lru.
On Mon, Jan 8, 2024 at 6:00 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Sun, Jan 7, 2024 at 1:29 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > On Fri, Jan 5, 2024 at 6:10 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > > > > > > > There is another option here, which is not to move the page to the > > > > > tail of the inactive > > > > > list after end_writeback and delete the following code in > > > > > zswap_writeback_entry(), > > > > > which did not work properly. But the pages will not be released first. > > > > > > > > > > /* move it to the tail of the inactive list after end_writeback */ > > > > > SetPageReclaim(page); > > > > > > Ok, so I took a look at the patch that originally introduced this > > piece of logic: > > > > https://github.com/torvalds/linux/commit/b349acc76b7f65400b85abd09a5379ddd6fa5a97 > > > > Looks like it's not for the sake of correctness, but only as a > > best-effort optimization (reducing page scanning). If it doesn't bring > > any benefit (i.e due to the newly allocated page still on the cpu > > batch), then we can consider removing it. After all, if you're right > > and it's not really doing anything here - why bother. Perhaps we can > > replace this with some other mechanism to avoid it being scanned for > > reclaim. > > For instance, we can grab the local lock, look for the folio in the > add batch and take the folio off it, then add it to the rotate batch > instead? Not sure if this is doable within folio_rotate_reclaimable(), > or you'll have to manually perform this yourself (and remove the > PG_reclaim flag set here so that folio_end_writeback() doesn't try to > handle it). > > There is still some overhead with this, but at least we don't have to > *drain everything* (which looks like what's lru_add_drain() -> > lru_add_drain_cpu() is doing). The latter sounds expensive and > unnecessary, whereas this is just one element addition and one element > removal - and if IIUC the size of the per-cpu add batch is capped at > 15, so lookup + removal (if possible) shouldn't be too expensive? > > Just throwing ideas out there :) Thanks for your time,Nhat. I will try other ways to solve this problem. > > > > > I would cc Weijie as well, as he is the original author of this.
Hi Yosry, glad to hear from you and happy new year! > Sorry for being late to the party. It seems to me that all of this > hassle can be avoided if lru_add_fn() did the right thing in this case > and added the folio to the tail of the lru directly. I am no expert in > how the page flags work here, but it seems like we can do something > like this in lru_add_fn(): > > if (folio_test_reclaim(folio)) > lruvec_add_folio_tail(lruvec, folio); > else > lruvec_add_folio(lruvec, folio); > > I think the main problem with this is that PG_reclaim is an alias to > PG_readahead, so readahead pages will also go to the tail of the lru, > which is probably not good. Agree with you, I will try it. > > A more intrusive alternative is to introduce a folio_lru_add_tail() > variant that always adds pages to the tail, and optionally call that > from __read_swap_cache_async() instead of folio_lru_add() based on a > new boolean argument. The zswap code can set that boolean argument > during writeback to make sure newly allocated folios are always added > to the tail of the lru. I have the same idea and also find it intrusive. I think the first solution is very good and I will try it. If it works, I will send the next version. Thank you very much.
On Mon, Jan 8, 2024 at 7:13 PM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > Hi Yosry, glad to hear from you and happy new year! > > > Sorry for being late to the party. It seems to me that all of this > > hassle can be avoided if lru_add_fn() did the right thing in this case > > and added the folio to the tail of the lru directly. I am no expert in > > how the page flags work here, but it seems like we can do something > > like this in lru_add_fn(): > > > > if (folio_test_reclaim(folio)) > > lruvec_add_folio_tail(lruvec, folio); > > else > > lruvec_add_folio(lruvec, folio); > > > > I think the main problem with this is that PG_reclaim is an alias to > > PG_readahead, so readahead pages will also go to the tail of the lru, > > which is probably not good. > > Agree with you, I will try it. +Matthew Wilcox I think we need to figure out if it's okay to do this first, because it will affect pages with PG_readahead as well. > > > > > A more intrusive alternative is to introduce a folio_lru_add_tail() > > variant that always adds pages to the tail, and optionally call that > > from __read_swap_cache_async() instead of folio_lru_add() based on a > > new boolean argument. The zswap code can set that boolean argument > > during writeback to make sure newly allocated folios are always added > > to the tail of the lru. > > I have the same idea and also find it intrusive. I think the first solution > is very good and I will try it. If it works, I will send the next version. One way to avoid introducing folio_lru_add_tail() and blumping a boolean from zswap is to have a per-task context (similar to memalloc_nofs_save()), that makes folio_add_lru() automatically add folios to the tail of the LRU. I am not sure if this is an acceptable approach though in terms of per-task flags and such.
On Tue, Jan 9, 2024 at 8:30 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Mon, Jan 8, 2024 at 7:13 PM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > > > Hi Yosry, glad to hear from you and happy new year! > > > > > Sorry for being late to the party. It seems to me that all of this > > > hassle can be avoided if lru_add_fn() did the right thing in this case > > > and added the folio to the tail of the lru directly. I am no expert in > > > how the page flags work here, but it seems like we can do something > > > like this in lru_add_fn(): > > > > > > if (folio_test_reclaim(folio)) > > > lruvec_add_folio_tail(lruvec, folio); > > > else > > > lruvec_add_folio(lruvec, folio); > > > > > > I think the main problem with this is that PG_reclaim is an alias to > > > PG_readahead, so readahead pages will also go to the tail of the lru, > > > which is probably not good. This sounds dangerous. This is going to introduce a rather large unexpected side effect - we're changing the readahead behavior in a seemingly small zswap optimization. In fact, I'd argue that if we do this, the readahead behavior change will be the "main effect", and the zswap-side change would be a "happy consequence". We should run a lot of benchmarking and document the change extensively if we pursue this route. > > > > Agree with you, I will try it. > > +Matthew Wilcox > > I think we need to figure out if it's okay to do this first, because > it will affect pages with PG_readahead as well. > > > > > > > > > A more intrusive alternative is to introduce a folio_lru_add_tail() > > > variant that always adds pages to the tail, and optionally call that > > > from __read_swap_cache_async() instead of folio_lru_add() based on a > > > new boolean argument. The zswap code can set that boolean argument > > > during writeback to make sure newly allocated folios are always added > > > to the tail of the lru. Unless some page flag/readahead expert can confirm that the first option is safe, my vote is on this option. I mean, it's fairly minimal codewise, no? Just a bunch of plumbing. We can also keep the other call sites intact if we just rename the old versions - something along the line of: __read_swap_cache_async_head(..., bool add_to_lru_head) { ... if (add_to_lru_head) folio_add_lru(folio) else folio_add_lru_tail(folio); } __read_swap_cache_async(...) { return __read_swap_cache_async_tail(..., true); } A bit boilerplate? Sure. But this seems safer, and I doubt it's *that* much more work. > > > > I have the same idea and also find it intrusive. I think the first solution > > is very good and I will try it. If it works, I will send the next version. > > One way to avoid introducing folio_lru_add_tail() and blumping a > boolean from zswap is to have a per-task context (similar to > memalloc_nofs_save()), that makes folio_add_lru() automatically add > folios to the tail of the LRU. I am not sure if this is an acceptable > approach though in terms of per-task flags and such. This seems a bit hacky and obscure, but maybe it could work.
On Wed, Jan 10, 2024 at 12:30 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Mon, Jan 8, 2024 at 7:13 PM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > > > Hi Yosry, glad to hear from you and happy new year! > > > > > Sorry for being late to the party. It seems to me that all of this > > > hassle can be avoided if lru_add_fn() did the right thing in this case > > > and added the folio to the tail of the lru directly. I am no expert in > > > how the page flags work here, but it seems like we can do something > > > like this in lru_add_fn(): > > > > > > if (folio_test_reclaim(folio)) > > > lruvec_add_folio_tail(lruvec, folio); > > > else > > > lruvec_add_folio(lruvec, folio); > > > > > > I think the main problem with this is that PG_reclaim is an alias to > > > PG_readahead, so readahead pages will also go to the tail of the lru, > > > which is probably not good. > > > > Agree with you, I will try it. > > +Matthew Wilcox > > I think we need to figure out if it's okay to do this first, because > it will affect pages with PG_readahead as well. Yes, I've tested it and there is one more thing that needs to be modified. + if (folio_test_reclaim(folio)) + lruvec_add_folio_tail(lruvec, folio); + else + lruvec_add_folio(lruvec, folio); @@ -1583,10 +1583,8 @@ void folio_end_writeback(struct folio *folio) * a gain to justify taking an atomic operation penalty at the * end of every folio writeback. */ - if (folio_test_reclaim(folio)) { + if (folio_test_reclaim(folio) && folio_rotate_reclaimable(folio)) folio_clear_reclaim(folio); - folio_rotate_reclaimable(folio); - } -void folio_rotate_reclaimable(struct folio *folio) +bool folio_rotate_reclaimable(struct folio *folio) { if (success) + return true; } + return false; } > > > > > > > > > A more intrusive alternative is to introduce a folio_lru_add_tail() > > > variant that always adds pages to the tail, and optionally call that > > > from __read_swap_cache_async() instead of folio_lru_add() based on a > > > new boolean argument. The zswap code can set that boolean argument > > > during writeback to make sure newly allocated folios are always added > > > to the tail of the lru. > > > > I have the same idea and also find it intrusive. I think the first solution > > is very good and I will try it. If it works, I will send the next version. > > One way to avoid introducing folio_lru_add_tail() and blumping a > boolean from zswap is to have a per-task context (similar to > memalloc_nofs_save()), that makes folio_add_lru() automatically add > folios to the tail of the LRU. I am not sure if this is an acceptable > approach though in terms of per-task flags and such. I got it.
> > This sounds dangerous. This is going to introduce a rather large > unexpected side effect - we're changing the readahead behavior in a > seemingly small zswap optimization. In fact, I'd argue that if we do > this, the readahead behavior change will be the "main effect", and the > zswap-side change would be a "happy consequence". We should run a lot > of benchmarking and document the change extensively if we pursue this > route. > I agree with the unexpected side effect, and here I need to clarify the original intention of this patch.Please see the memory offloading steps below. memory zswap(reclaim) memory+swap (writeback) 1G 0.5G 1G(tmp memory) + 1G(swap) If the decompressed memory cannot be released in time, zswap's writeback has great side effects(mostly clod pages). On the one hand, the memory space has not been reduced, but has increased (from 0.5G->1G). At the same time, it is not put the pages to the tail of the lru. When the memory is insufficient, other pages will be squeezed out and released early. With this patch, we can put the tmp pages to the tail and reclaim it in time when the memory is insufficient or actively reclaimed. So I think this patch makes sense and hope it can be fixed with a suitable approaches. > > Unless some page flag/readahead expert can confirm that the first > option is safe, my vote is on this option. I mean, it's fairly minimal > codewise, no? Just a bunch of plumbing. We can also keep the other > call sites intact if we just rename the old versions - something along > the line of: > > __read_swap_cache_async_head(..., bool add_to_lru_head) > { > ... > if (add_to_lru_head) > folio_add_lru(folio) > else > folio_add_lru_tail(folio); > } > > __read_swap_cache_async(...) > { > return __read_swap_cache_async_tail(..., true); > } > > A bit boilerplate? Sure. But this seems safer, and I doubt it's *that* > much more work. > Yes, agree. I will try it again. > > > > > > I have the same idea and also find it intrusive. I think the first solution > > > is very good and I will try it. If it works, I will send the next version. > > > > One way to avoid introducing folio_lru_add_tail() and blumping a > > boolean from zswap is to have a per-task context (similar to > > memalloc_nofs_save()), that makes folio_add_lru() automatically add > > folios to the tail of the LRU. I am not sure if this is an acceptable > > approach though in terms of per-task flags and such. > > This seems a bit hacky and obscure, but maybe it could work.
On Wed, Jan 10, 2024 at 7:49 PM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > > > > This sounds dangerous. This is going to introduce a rather large > > unexpected side effect - we're changing the readahead behavior in a > > seemingly small zswap optimization. In fact, I'd argue that if we do > > this, the readahead behavior change will be the "main effect", and the > > zswap-side change would be a "happy consequence". We should run a lot > > of benchmarking and document the change extensively if we pursue this > > route. > > > > I agree with the unexpected side effect, and here I need > to clarify the original intention of this patch.Please see the memory > offloading steps below. > > > memory zswap(reclaim) memory+swap (writeback) > 1G 0.5G 1G(tmp memory) + 1G(swap) > > If the decompressed memory cannot be released in time, > zswap's writeback has great side effects(mostly clod pages). > On the one hand, the memory space has not been reduced, > but has increased (from 0.5G->1G). > At the same time, it is not put the pages to the tail of the lru. > When the memory is insufficient, other pages will be squeezed out > and released early. > With this patch, we can put the tmp pages to the tail and reclaim it > in time when the memory is insufficient or actively reclaimed. > So I think this patch makes sense and hope it can be fixed with a > suitable approaches. > > > > > Unless some page flag/readahead expert can confirm that the first > > option is safe, my vote is on this option. I mean, it's fairly minimal > > codewise, no? Just a bunch of plumbing. We can also keep the other > > call sites intact if we just rename the old versions - something along > > the line of: > > > > __read_swap_cache_async_head(..., bool add_to_lru_head) > > { > > ... > > if (add_to_lru_head) > > folio_add_lru(folio) > > else > > folio_add_lru_tail(folio); > > } > > > > __read_swap_cache_async(...) > > { > > return __read_swap_cache_async_tail(..., true); > > } > > > > A bit boilerplate? Sure. But this seems safer, and I doubt it's *that* > > much more work. > > > > Yes, agree. I will try it again. I agree with Nhat here. Unless someone with enough readahead/page flags knowledge says putting PG_readahead pages at the tail of the LRU is okay (doubtful), I think we should opt for introducing a folio_add_lru_tail() as I initially suggested.
On Wed, Jan 10, 2024 at 7:49 PM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > > > > This sounds dangerous. This is going to introduce a rather large > > unexpected side effect - we're changing the readahead behavior in a > > seemingly small zswap optimization. In fact, I'd argue that if we do > > this, the readahead behavior change will be the "main effect", and the > > zswap-side change would be a "happy consequence". We should run a lot > > of benchmarking and document the change extensively if we pursue this > > route. > > > > I agree with the unexpected side effect, and here I need > to clarify the original intention of this patch.Please see the memory > offloading steps below. > > > memory zswap(reclaim) memory+swap (writeback) > 1G 0.5G 1G(tmp memory) + 1G(swap) > > If the decompressed memory cannot be released in time, > zswap's writeback has great side effects(mostly clod pages). > On the one hand, the memory space has not been reduced, > but has increased (from 0.5G->1G). > At the same time, it is not put the pages to the tail of the lru. > When the memory is insufficient, other pages will be squeezed out > and released early. > With this patch, we can put the tmp pages to the tail and reclaim it > in time when the memory is insufficient or actively reclaimed. > So I think this patch makes sense and hope it can be fixed with a > suitable approaches. Makes sense to me. IIUC, that's the original intention behind calling SetPageReclaim() - unfortunately that doesn't work :) And IIRC, your original attempt shows reduction in swap usage (albeit at the cost of performance regression), which means we're onto something. I believe that the folio_lru_add_tail() approach will work :) Please include a version of the clarification paragraph above in your later version to explain the goal of the optimization, along with suitable benchmark numbers to show the effect (such as minimal change in performance, and reduction in some metrics). Maybe include the link to the original patch that introduces SetPageReclaim() too, to show the motivation behind all of this :) It'd be nice to have all the contexts readily available, in case we need to revisit this in the future (as was the case with the SetPageReclaim() here). > > > > > Unless some page flag/readahead expert can confirm that the first > > option is safe, my vote is on this option. I mean, it's fairly minimal > > codewise, no? Just a bunch of plumbing. We can also keep the other > > call sites intact if we just rename the old versions - something along > > the line of: > > > > __read_swap_cache_async_head(..., bool add_to_lru_head) > > { > > ... > > if (add_to_lru_head) > > folio_add_lru(folio) > > else > > folio_add_lru_tail(folio); > > } > > > > __read_swap_cache_async(...) > > { > > return __read_swap_cache_async_tail(..., true); > > } > > > > A bit boilerplate? Sure. But this seems safer, and I doubt it's *that* > > much more work. > > > > Yes, agree. I will try it again. Look forward to seeing it! Thanks for your patience and for working on this.
On Fri, Jan 12, 2024 at 3:25 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Wed, Jan 10, 2024 at 7:49 PM Zhongkun He > <hezhongkun.hzk@bytedance.com> wrote: > > > > > > > > This sounds dangerous. This is going to introduce a rather large > > > unexpected side effect - we're changing the readahead behavior in a > > > seemingly small zswap optimization. In fact, I'd argue that if we do > > > this, the readahead behavior change will be the "main effect", and the > > > zswap-side change would be a "happy consequence". We should run a lot > > > of benchmarking and document the change extensively if we pursue this > > > route. > > > > > > > I agree with the unexpected side effect, and here I need > > to clarify the original intention of this patch.Please see the memory > > offloading steps below. > > > > > > memory zswap(reclaim) memory+swap (writeback) > > 1G 0.5G 1G(tmp memory) + 1G(swap) > > > > If the decompressed memory cannot be released in time, > > zswap's writeback has great side effects(mostly clod pages). > > On the one hand, the memory space has not been reduced, > > but has increased (from 0.5G->1G). > > At the same time, it is not put the pages to the tail of the lru. > > When the memory is insufficient, other pages will be squeezed out > > and released early. > > With this patch, we can put the tmp pages to the tail and reclaim it > > in time when the memory is insufficient or actively reclaimed. > > So I think this patch makes sense and hope it can be fixed with a > > suitable approaches. > > Makes sense to me. IIUC, that's the original intention behind calling > SetPageReclaim() - unfortunately that doesn't work :) And IIRC, your > original attempt shows reduction in swap usage (albeit at the cost of > performance regression), which means we're onto something. I believe > that the folio_lru_add_tail() approach will work :) > > Please include a version of the clarification paragraph above in your > later version to explain the goal of the optimization, along with > suitable benchmark numbers to show the effect (such as minimal change > in performance, and reduction in some metrics). Maybe include the link > to the original patch that introduces SetPageReclaim() too, to show > the motivation behind all of this :) It'd be nice to have all the > contexts readily available, in case we need to revisit this in the > future (as was the case with the SetPageReclaim() here). > OK. > > > > > > > > Unless some page flag/readahead expert can confirm that the first > > > option is safe, my vote is on this option. I mean, it's fairly minimal > > > codewise, no? Just a bunch of plumbing. We can also keep the other > > > call sites intact if we just rename the old versions - something along > > > the line of: > > > > > > __read_swap_cache_async_head(..., bool add_to_lru_head) > > > { > > > ... > > > if (add_to_lru_head) > > > folio_add_lru(folio) > > > else > > > folio_add_lru_tail(folio); > > > } > > > > > > __read_swap_cache_async(...) > > > { > > > return __read_swap_cache_async_tail(..., true); > > > } > > > > > > A bit boilerplate? Sure. But this seems safer, and I doubt it's *that* > > > much more work. > > > > > > > Yes, agree. I will try it again. > > Look forward to seeing it! Thanks for your patience and for working on this. Thanks for your time.
> > > > > > > > Unless some page flag/readahead expert can confirm that the first > > > > option is safe, my vote is on this option. I mean, it's fairly minimal > > > > codewise, no? Just a bunch of plumbing. We can also keep the other > > > > call sites intact if we just rename the old versions - something along > > > > the line of: > > > > > > > > __read_swap_cache_async_head(..., bool add_to_lru_head) > > > > { > > > > ... > > > > if (add_to_lru_head) > > > > folio_add_lru(folio) > > > > else > > > > folio_add_lru_tail(folio); > > > > } > > > > > > > > __read_swap_cache_async(...) > > > > { > > > > return __read_swap_cache_async_tail(..., true); > > > > } > > > > > > > > A bit boilerplate? Sure. But this seems safer, and I doubt it's *that* > > > > much more work. > > > > > > > > > > Yes, agree. I will try it again. > > > > Look forward to seeing it! Thanks for your patience and for working on this. Please forgive me for adding additional information about this patch. I have finished the opt for introducing a folio_add_lru_tail(), but there are many questions: 1) A new page can be move to LRU only by lru_add_fn, so folio_add_lru_tail could not add pages to LRU for the following code in folio_batch_move_lru(),which is added by Alex Shi for serializing memcg changes in pagevec_lru_move_fn[1]. /* block memcg migration while the folio moves between lru */ if (move_fn != lru_add_fn && !folio_test_clear_lru(folio)) continue; To achieve the goal, we need to add a new function like lru_add_fn which does not have the lru flag and folio_add_lru_tail() + if (move_fn != lru_add_fn && move_fn != lru_move_tail_fn_new && + !folio_test_clear_lru(folio)) 2) __read_swap_cache_async has six parameters, so there is no space to add a new one, add_to_lru_head. So it seems a bit hacky just for a special case for the reasons above. Back to the beginning, lru_add_drain() is the simplest option,which is common below the __read_swap_cache_async(). Please see the function swap_cluster_readahead() and swap_vma_readahead(), of course it has been batched. Or we should leave this problem alone,before we can write back zswap in batches. Thanks again.
On Tue, Jan 16, 2024 at 5:40 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > > > > > > > > > > Unless some page flag/readahead expert can confirm that the first > > > > > option is safe, my vote is on this option. I mean, it's fairly minimal > > > > > codewise, no? Just a bunch of plumbing. We can also keep the other > > > > > call sites intact if we just rename the old versions - something along > > > > > the line of: > > > > > > > > > > __read_swap_cache_async_head(..., bool add_to_lru_head) > > > > > { > > > > > ... > > > > > if (add_to_lru_head) > > > > > folio_add_lru(folio) > > > > > else > > > > > folio_add_lru_tail(folio); > > > > > } > > > > > > > > > > __read_swap_cache_async(...) > > > > > { > > > > > return __read_swap_cache_async_tail(..., true); > > > > > } > > > > > > > > > > A bit boilerplate? Sure. But this seems safer, and I doubt it's *that* > > > > > much more work. > > > > > > > > > > > > > Yes, agree. I will try it again. > > > > > > Look forward to seeing it! Thanks for your patience and for working on this. > > Please forgive me for adding additional information about this patch. > > I have finished the opt for introducing a folio_add_lru_tail(), but > there are many > questions: > 1) A new page can be move to LRU only by lru_add_fn, so > folio_add_lru_tail could not add pages to LRU for the following code > in folio_batch_move_lru(),which is added by Alex Shi for > serializing memcg changes in pagevec_lru_move_fn[1]. > > /* block memcg migration while the folio moves between lru */ > if (move_fn != lru_add_fn && !folio_test_clear_lru(folio)) > continue; > To achieve the goal, we need to add a new function like lru_add_fn > which does not have the lru flag and folio_add_lru_tail() > + if (move_fn != lru_add_fn && move_fn != lru_move_tail_fn_new && > + !folio_test_clear_lru(folio)) > > 2) __read_swap_cache_async has six parameters, so there is no space to > add a new one, add_to_lru_head. > > So it seems a bit hacky just for a special case for the reasons above. It's a lot of plumbing for sure. Adding a flag to current task_struct is a less-noisy yet-still-hacky solution. I am not saying we should do it, but it's an option. I am not sure how much task flags we have to spare. > > Back to the beginning, lru_add_drain() is the simplest option,which is common > below the __read_swap_cache_async(). Please see the function > swap_cluster_readahead() > and swap_vma_readahead(), of course it has been batched. > > Or we should leave this problem alone,before we can write back zswap > in batches. Calling lru_add_drain() for every written back page is an overkill imo. If we have writeback batching at some point, it may make more sense then. Adding Michal Hocko was recently complaining [1] about lru_add_drain() being called unnecessarily elsewhere. [1]https://lore.kernel.org/linux-mm/ZaD9BNtXZfY2UtVI@tiehlicka/
On Tue, Jan 16, 2024 at 09:40:05PM +0800, Zhongkun He wrote: > 2) __read_swap_cache_async has six parameters, so there is no space to > add a new one, add_to_lru_head. That's easy enough. Define a new set of flags, make one of them the equivalent of skip_if_exists. Something like: typedef unsigned int __bitwise read_swap_t; #define READ_SWAP_SKIP_EXISTING ((__force read_swap_t)0x00000001) #define READ_SWAP_ADD_TAIL ((__force read_swap_t)0x00000002) There's only six callers of __read_swap_cache_async() to convert, so not really a big deal.
> > > > Please forgive me for adding additional information about this patch. > > > > I have finished the opt for introducing a folio_add_lru_tail(), but > > there are many > > questions: > > 1) A new page can be move to LRU only by lru_add_fn, so > > folio_add_lru_tail could not add pages to LRU for the following code > > in folio_batch_move_lru(),which is added by Alex Shi for > > serializing memcg changes in pagevec_lru_move_fn[1]. > > > > /* block memcg migration while the folio moves between lru */ > > if (move_fn != lru_add_fn && !folio_test_clear_lru(folio)) > > continue; > > To achieve the goal, we need to add a new function like lru_add_fn > > which does not have the lru flag and folio_add_lru_tail() > > + if (move_fn != lru_add_fn && move_fn != lru_move_tail_fn_new && > > + !folio_test_clear_lru(folio)) > > > > 2) __read_swap_cache_async has six parameters, so there is no space to > > add a new one, add_to_lru_head. > > > > So it seems a bit hacky just for a special case for the reasons above. > > It's a lot of plumbing for sure. Adding a flag to current task_struct > is a less-noisy yet-still-hacky solution. I am not saying we should do > it, but it's an option. I am not sure how much task flags we have to > spare. Got it. > > > > > Back to the beginning, lru_add_drain() is the simplest option,which is common > > below the __read_swap_cache_async(). Please see the function > > swap_cluster_readahead() > > and swap_vma_readahead(), of course it has been batched. > > > > Or we should leave this problem alone,before we can write back zswap > > in batches. > > Calling lru_add_drain() for every written back page is an overkill > imo. If we have writeback batching at some point, it may make more > sense then. Agree. > > Adding Michal Hocko was recently complaining [1] about lru_add_drain() > being called unnecessarily elsewhere. Got it, thanks. > > [1]https://lore.kernel.org/linux-mm/ZaD9BNtXZfY2UtVI@tiehlicka/
On Wed, Jan 17, 2024 at 5:04 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, Jan 16, 2024 at 09:40:05PM +0800, Zhongkun He wrote: > > 2) __read_swap_cache_async has six parameters, so there is no space to > > add a new one, add_to_lru_head. > > That's easy enough. Define a new set of flags, make one of them the > equivalent of skip_if_exists. Something like: > > typedef unsigned int __bitwise read_swap_t; > > #define READ_SWAP_SKIP_EXISTING ((__force read_swap_t)0x00000001) > #define READ_SWAP_ADD_TAIL ((__force read_swap_t)0x00000002) > > There's only six callers of __read_swap_cache_async() to convert, so not > really a big deal. > Yes, thanks for your suggestion. The major problem is not the parameters, but the need to add three functions to deal with a special case. Thanks again.
On Wed, Jan 17, 2024 at 1:53 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > > > > > > Please forgive me for adding additional information about this patch. > > > > > > I have finished the opt for introducing a folio_add_lru_tail(), but > > > there are many > > > questions: > > > 1) A new page can be move to LRU only by lru_add_fn, so > > > folio_add_lru_tail could not add pages to LRU for the following code > > > in folio_batch_move_lru(),which is added by Alex Shi for > > > serializing memcg changes in pagevec_lru_move_fn[1]. > > > > > > /* block memcg migration while the folio moves between lru */ > > > if (move_fn != lru_add_fn && !folio_test_clear_lru(folio)) > > > continue; > > > To achieve the goal, we need to add a new function like lru_add_fn > > > which does not have the lru flag and folio_add_lru_tail() > > > + if (move_fn != lru_add_fn && move_fn != lru_move_tail_fn_new && > > > + !folio_test_clear_lru(folio)) > > > > > > 2) __read_swap_cache_async has six parameters, so there is no space to > > > add a new one, add_to_lru_head. > > > > > > So it seems a bit hacky just for a special case for the reasons above. > > > > It's a lot of plumbing for sure. Adding a flag to current task_struct > > is a less-noisy yet-still-hacky solution. I am not saying we should do > > it, but it's an option. I am not sure how much task flags we have to > > spare. > > Got it. Actually this won't really work. Writebak can be asynchronous, so there would be no logical place to unset the flag.
On Wed, Jan 17, 2024 at 1:52 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > > > > > > Please forgive me for adding additional information about this patch. > > > > > > I have finished the opt for introducing a folio_add_lru_tail(), but > > > there are many > > > questions: > > > 1) A new page can be move to LRU only by lru_add_fn, so > > > folio_add_lru_tail could not add pages to LRU for the following code > > > in folio_batch_move_lru(),which is added by Alex Shi for > > > serializing memcg changes in pagevec_lru_move_fn[1]. > > > > > > /* block memcg migration while the folio moves between lru */ > > > if (move_fn != lru_add_fn && !folio_test_clear_lru(folio)) > > > continue; > > > To achieve the goal, we need to add a new function like lru_add_fn > > > which does not have the lru flag and folio_add_lru_tail() > > > + if (move_fn != lru_add_fn && move_fn != lru_move_tail_fn_new && > > > + !folio_test_clear_lru(folio)) Hmm yeah, I guess it is a bit more plumbing to do. I prefer this though - not very fond of hacking current's flag just for a small optimization :) And I'd argue this is the "right" thing to do - draining the other LRU operation batches just so that we can successfully perform an add-to-tail seems hacky and wrong to me. > > > > > > 2) __read_swap_cache_async has six parameters, so there is no space to > > > add a new one, add_to_lru_head. Matthew's solution seems fine to me, no? i.e using a single flag parameter to encapsulate all boolean arguments. > > > > > > So it seems a bit hacky just for a special case for the reasons above. > > > > It's a lot of plumbing for sure. Adding a flag to current task_struct > > is a less-noisy yet-still-hacky solution. I am not saying we should do > > it, but it's an option. I am not sure how much task flags we have to > > spare. > > Got it. > > > > > > > > Back to the beginning, lru_add_drain() is the simplest option,which is common > > > below the __read_swap_cache_async(). Please see the function > > > swap_cluster_readahead() > > > and swap_vma_readahead(), of course it has been batched. > > > > > > Or we should leave this problem alone,before we can write back zswap > > > in batches. > > > > Calling lru_add_drain() for every written back page is an overkill > > imo. If we have writeback batching at some point, it may make more > > sense then. > > Agree. Agree. lru_add_drain() does quite a bit, and doing it for every written page makes batching less effective. And as argued above, I don't think we should do this. I'm fine with waiting til writeback batching too :) But that will be a bigger task. > > > > > Adding Michal Hocko was recently complaining [1] about lru_add_drain() > > being called unnecessarily elsewhere. > > Got it, thanks. > > > > [1]https://lore.kernel.org/linux-mm/ZaD9BNtXZfY2UtVI@tiehlicka/
diff --git a/mm/zswap.c b/mm/zswap.c index 083c693602b8..b9b94cbd403c 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1139,6 +1139,11 @@ static int zswap_writeback_entry(struct zswap_entry *entry, /* move it to the tail of the inactive list after end_writeback */ SetPageReclaim(page); + if (!PageLRU(page)) { + /* drain lru cache to help folio_rotate_reclaimable() */ + lru_add_drain(); + } + /* start writeback */ __swap_writepage(page, &wbc); put_page(page);
The zswap_writeback_entry() will add a page to the swap cache, decompress the entry data into the page, and issue a bio write to write the page back to the swap device. Move the page to the tail of lru list through SetPageReclaim(page) and folio_rotate_reclaimable(). Currently, about half of the pages will fail to move to the tail of lru list because there is no LRU flag in page which is not in the LRU list but the cpu_fbatches. So fix it. Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com> --- mm/zswap.c | 5 +++++ 1 file changed, 5 insertions(+)