Message ID | 20240408121439.GA252652@bytedance (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mm: swap: prejudgement swap_has_cache to avoid page allocation | expand |
On Mon, 8 Apr 2024 20:14:39 +0800 Zhaoyu Liu <liuzhaoyu.zackary@bytedance.com> wrote: > Based on qemu arm64 - latest kernel + 100M memory + 1024M swapfile. > Create 1G anon mmap and set it to shared, and has two processes > randomly access the shared memory. When they are racing on swap cache, > on average, each "alloc_pages_mpol + swapcache_prepare + folio_put" > took about 1475 us. And what effect does this patch have upon the measured time? ANd upon overall runtime? > So skip page allocation if SWAP_HAS_CACHE was set, just > schedule_timeout_uninterruptible and continue to acquire page > via filemap_get_folio() from swap cache, to speedup > __read_swap_cache_async.
Andrew Morton <akpm@linux-foundation.org> writes: > On Mon, 8 Apr 2024 20:14:39 +0800 Zhaoyu Liu <liuzhaoyu.zackary@bytedance.com> wrote: > >> Based on qemu arm64 - latest kernel + 100M memory + 1024M swapfile. >> Create 1G anon mmap and set it to shared, and has two processes >> randomly access the shared memory. When they are racing on swap cache, >> on average, each "alloc_pages_mpol + swapcache_prepare + folio_put" >> took about 1475 us. > > And what effect does this patch have upon the measured time? ANd upon > overall runtime? And the patch will cause increased lock contention, please test with more processes and perhaps HDD swap device too. >> So skip page allocation if SWAP_HAS_CACHE was set, just >> schedule_timeout_uninterruptible and continue to acquire page >> via filemap_get_folio() from swap cache, to speedup >> __read_swap_cache_async. -- Best Regards, Huang, Ying
On Mon, Apr 08, 2024 at 01:27:04PM -0700, Andrew Morton wrote: > On Mon, 8 Apr 2024 20:14:39 +0800 Zhaoyu Liu <liuzhaoyu.zackary@bytedance.com> wrote: > > > Based on qemu arm64 - latest kernel + 100M memory + 1024M swapfile. > > Create 1G anon mmap and set it to shared, and has two processes > > randomly access the shared memory. When they are racing on swap cache, > > on average, each "alloc_pages_mpol + swapcache_prepare + folio_put" > > took about 1475 us. > > And what effect does this patch have upon the measured time? ANd upon > overall runtime? Hi Andrew, When share memory between two or more processes has swapped and pagefault now, it would readahead swap and call __read_swap_cache_async(). If one of the processes calls swapcache_prepare() and finds that the cache has been EXIST(another process added), it will folio_put on the basis of the alloc_pages_mpol() that has been called, and then try filemap_get_folio() again. I think the page alloc in this process is wasteful. when the memory pressure is large, alloc_pages_mpol() will be time-consuming, so the purpose of my patch is to judge whether the page has cache before page alloc, then skip page alloc and retry filemap_get_folio() to save the time of the function. Thank you. > > > So skip page allocation if SWAP_HAS_CACHE was set, just > > schedule_timeout_uninterruptible and continue to acquire page > > via filemap_get_folio() from swap cache, to speedup > > __read_swap_cache_async.
On Tue, Apr 09, 2024 at 09:07:29AM +0800, Huang, Ying wrote: > Andrew Morton <akpm@linux-foundation.org> writes: > > > On Mon, 8 Apr 2024 20:14:39 +0800 Zhaoyu Liu <liuzhaoyu.zackary@bytedance.com> wrote: > > > >> Based on qemu arm64 - latest kernel + 100M memory + 1024M swapfile. > >> Create 1G anon mmap and set it to shared, and has two processes > >> randomly access the shared memory. When they are racing on swap cache, > >> on average, each "alloc_pages_mpol + swapcache_prepare + folio_put" > >> took about 1475 us. > > > > And what effect does this patch have upon the measured time? ANd upon > > overall runtime? > > And the patch will cause increased lock contention, please test with > more processes and perhaps HDD swap device too. Hi Ying, Thank you for your suggestion. It may indeed cause some lock contention, as mentioned by Kairui before. If so, is it recommended? --- unsigned char swap_map, mapcount, hascache; ... /* Return raw data of the si->swap_map[offset] */ swap_map = __swap_map(si, entry); mapcount = swap_map & ~SWAP_HAS_CACHE; if (!mapcount && swap_slot_cache_enabled) ... hascache = swap_map & SWAP_HAS_CACHE; /* Could judge that it's being added to swap cache with high probability */ if (mapcount && hascache) goto skip_alloc; ... --- In doing so, there is no additional use of locks. > > >> So skip page allocation if SWAP_HAS_CACHE was set, just > >> schedule_timeout_uninterruptible and continue to acquire page > >> via filemap_get_folio() from swap cache, to speedup > >> __read_swap_cache_async. > > -- > Best Regards, > Huang, Ying
On Tue, Apr 9, 2024 at 7:57 AM Zhaoyu Liu <liuzhaoyu.zackary@bytedance.com> wrote: > > On Tue, Apr 09, 2024 at 09:07:29AM +0800, Huang, Ying wrote: > > Andrew Morton <akpm@linux-foundation.org> writes: > > > > > On Mon, 8 Apr 2024 20:14:39 +0800 Zhaoyu Liu <liuzhaoyu.zackary@bytedance.com> wrote: > > > > > >> Based on qemu arm64 - latest kernel + 100M memory + 1024M swapfile. > > >> Create 1G anon mmap and set it to shared, and has two processes > > >> randomly access the shared memory. When they are racing on swap cache, > > >> on average, each "alloc_pages_mpol + swapcache_prepare + folio_put" > > >> took about 1475 us. > > > > > > And what effect does this patch have upon the measured time? ANd upon > > > overall runtime? > > > > And the patch will cause increased lock contention, please test with > > more processes and perhaps HDD swap device too. > > Hi Ying, > > Thank you for your suggestion. > It may indeed cause some lock contention, as mentioned by Kairui before. > > If so, is it recommended? > --- > unsigned char swap_map, mapcount, hascache; > ... > /* Return raw data of the si->swap_map[offset] */ > swap_map = __swap_map(si, entry); > mapcount = swap_map & ~SWAP_HAS_CACHE; > if (!mapcount && swap_slot_cache_enabled) > ... > hascache = swap_map & SWAP_HAS_CACHE; > /* Could judge that it's being added to swap cache with high probability */ > if (mapcount && hascache) > goto skip_alloc; > ... > --- > In doing so, there is no additional use of locks. > Hmm so is this a lockless check now? Ummmm... Could someone with more expertise in the Linux kernel memory model double check that this is even a valid state we're observing here? Looks like we're performing an unguarded, unsynchronized, non-atomic read with the possibility of concurrent write - is there a chance we might see partial/invalid results? Could you also test with zswap enabled (and perhaps with zswap shrinker enabled)?
Zhaoyu Liu <liuzhaoyu.zackary@bytedance.com> writes: > On Mon, Apr 08, 2024 at 01:27:04PM -0700, Andrew Morton wrote: >> On Mon, 8 Apr 2024 20:14:39 +0800 Zhaoyu Liu <liuzhaoyu.zackary@bytedance.com> wrote: >> >> > Based on qemu arm64 - latest kernel + 100M memory + 1024M swapfile. >> > Create 1G anon mmap and set it to shared, and has two processes >> > randomly access the shared memory. When they are racing on swap cache, >> > on average, each "alloc_pages_mpol + swapcache_prepare + folio_put" >> > took about 1475 us. >> >> And what effect does this patch have upon the measured time? ANd upon >> overall runtime? > > Hi Andrew, > > When share memory between two or more processes has swapped and pagefault now, > it would readahead swap and call __read_swap_cache_async(). > If one of the processes calls swapcache_prepare() and finds that the cache > has been EXIST(another process added), it will folio_put on the basis of the > alloc_pages_mpol() that has been called, and then try filemap_get_folio() again. > > I think the page alloc in this process is wasteful. > when the memory pressure is large, alloc_pages_mpol() will be time-consuming, > so the purpose of my patch is to judge whether the page has cache before page alloc, > then skip page alloc and retry filemap_get_folio() to save the time of the function. Please prove your theory with data, better with benchmark score. -- Best Regards, Huang, Ying >> >> > So skip page allocation if SWAP_HAS_CACHE was set, just >> > schedule_timeout_uninterruptible and continue to acquire page >> > via filemap_get_folio() from swap cache, to speedup >> > __read_swap_cache_async.
Zhaoyu Liu <liuzhaoyu.zackary@bytedance.com> writes: > On Tue, Apr 09, 2024 at 09:07:29AM +0800, Huang, Ying wrote: >> Andrew Morton <akpm@linux-foundation.org> writes: >> >> > On Mon, 8 Apr 2024 20:14:39 +0800 Zhaoyu Liu <liuzhaoyu.zackary@bytedance.com> wrote: >> > >> >> Based on qemu arm64 - latest kernel + 100M memory + 1024M swapfile. >> >> Create 1G anon mmap and set it to shared, and has two processes >> >> randomly access the shared memory. When they are racing on swap cache, >> >> on average, each "alloc_pages_mpol + swapcache_prepare + folio_put" >> >> took about 1475 us. >> > >> > And what effect does this patch have upon the measured time? ANd upon >> > overall runtime? >> >> And the patch will cause increased lock contention, please test with >> more processes and perhaps HDD swap device too. > > Hi Ying, > > Thank you for your suggestion. > It may indeed cause some lock contention, as mentioned by Kairui before. > > If so, is it recommended? > --- > unsigned char swap_map, mapcount, hascache; > ... > /* Return raw data of the si->swap_map[offset] */ > swap_map = __swap_map(si, entry); > mapcount = swap_map & ~SWAP_HAS_CACHE; > if (!mapcount && swap_slot_cache_enabled) > ... > hascache = swap_map & SWAP_HAS_CACHE; > /* Could judge that it's being added to swap cache with high probability */ > if (mapcount && hascache) > goto skip_alloc; > ... > --- > In doing so, there is no additional use of locks. Yes. This can remove the lock-contention. But, you need to prove that it's necessary in the first place. -- Best Regards, Huang, Ying >> >> So skip page allocation if SWAP_HAS_CACHE was set, just >> >> schedule_timeout_uninterruptible and continue to acquire page >> >> via filemap_get_folio() from swap cache, to speedup >> >> __read_swap_cache_async.
Nhat Pham <nphamcs@gmail.com> writes: > On Tue, Apr 9, 2024 at 7:57 AM Zhaoyu Liu > <liuzhaoyu.zackary@bytedance.com> wrote: >> >> On Tue, Apr 09, 2024 at 09:07:29AM +0800, Huang, Ying wrote: >> > Andrew Morton <akpm@linux-foundation.org> writes: >> > >> > > On Mon, 8 Apr 2024 20:14:39 +0800 Zhaoyu Liu <liuzhaoyu.zackary@bytedance.com> wrote: >> > > >> > >> Based on qemu arm64 - latest kernel + 100M memory + 1024M swapfile. >> > >> Create 1G anon mmap and set it to shared, and has two processes >> > >> randomly access the shared memory. When they are racing on swap cache, >> > >> on average, each "alloc_pages_mpol + swapcache_prepare + folio_put" >> > >> took about 1475 us. >> > > >> > > And what effect does this patch have upon the measured time? ANd upon >> > > overall runtime? >> > >> > And the patch will cause increased lock contention, please test with >> > more processes and perhaps HDD swap device too. >> >> Hi Ying, >> >> Thank you for your suggestion. >> It may indeed cause some lock contention, as mentioned by Kairui before. >> >> If so, is it recommended? >> --- >> unsigned char swap_map, mapcount, hascache; >> ... >> /* Return raw data of the si->swap_map[offset] */ >> swap_map = __swap_map(si, entry); >> mapcount = swap_map & ~SWAP_HAS_CACHE; >> if (!mapcount && swap_slot_cache_enabled) >> ... >> hascache = swap_map & SWAP_HAS_CACHE; >> /* Could judge that it's being added to swap cache with high probability */ >> if (mapcount && hascache) >> goto skip_alloc; >> ... >> --- >> In doing so, there is no additional use of locks. >> > > Hmm so is this a lockless check now? Ummmm... Could someone with more > expertise in the Linux kernel memory model double check that this is > even a valid state we're observing here? Looks like we're performing > an unguarded, unsynchronized, non-atomic read with the possibility of > concurrent write - is there a chance we might see partial/invalid > results? > > Could you also test with zswap enabled (and perhaps with zswap > shrinker enabled)? READ_ONCE() will save us from partial/invalid results. -- Best Regards, Huang, Ying
On Mon, 2024-04-08 at 20:14 +0800, Zhaoyu Liu wrote: > Based on qemu arm64 - latest kernel + 100M memory + 1024M swapfile. > Create 1G anon mmap and set it to shared, and has two processes > randomly access the shared memory. When they are racing on swap cache, > on average, each "alloc_pages_mpol + swapcache_prepare + folio_put" > took about 1475 us. > > So skip page allocation if SWAP_HAS_CACHE was set, just > schedule_timeout_uninterruptible and continue to acquire page > via filemap_get_folio() from swap cache, to speedup > __read_swap_cache_async. > > Signed-off-by: Zhaoyu Liu <liuzhaoyu.zackary@bytedance.com> > --- > Changes in v2: > - Fix the patch format and rebase to latest linux-next. > --- > include/linux/swap.h | 6 ++++++ > mm/swap_state.c | 10 ++++++++++ > mm/swapfile.c | 15 +++++++++++++++ > 3 files changed, 31 insertions(+) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 11c53692f65f..a374070e05a7 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -492,6 +492,7 @@ extern sector_t swapdev_block(int, pgoff_t); > extern int __swap_count(swp_entry_t entry); > extern int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry); > extern int swp_swapcount(swp_entry_t entry); > +extern bool swap_has_cache(struct swap_info_struct *si, swp_entry_t entry); > struct swap_info_struct *swp_swap_info(swp_entry_t entry); > struct backing_dev_info; > extern int init_swap_address_space(unsigned int type, unsigned long nr_pages); > @@ -583,6 +584,11 @@ static inline int swp_swapcount(swp_entry_t entry) > return 0; > } > > +static inline bool swap_has_cache(struct swap_info_struct *si, swp_entry_t entry) > +{ > + return false; > +} > + > static inline swp_entry_t folio_alloc_swap(struct folio *folio) > { > swp_entry_t entry; > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 642c30d8376c..f117fbf18b59 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -462,6 +462,15 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > if (!swap_swapcount(si, entry) && swap_slot_cache_enabled) > goto fail_put_swap; > > + /* > + * Skipping page allocation if SWAP_HAS_CACHE was set, > + * just schedule_timeout_uninterruptible and continue to > + * acquire page via filemap_get_folio() from swap cache, > + * to speedup __read_swap_cache_async. > + */ > + if (swap_has_cache(si, entry)) > + goto skip_alloc; > + I think most of the cases where a page already exists will be caught by filemap_get_folio(). The cases caught by this extra check should be when we have races between page cache update and the read async, which may not be that often. So please verify with benchmark that this extra check with its own overhead would buy us anything. Tim > /* > * Get a new folio to read into from swap. Allocate it now, > * before marking swap_map SWAP_HAS_CACHE, when -EEXIST will > @@ -483,6 +492,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > if (err != -EEXIST) > goto fail_put_swap; > > +skip_alloc: > /* > * Protect against a recursive call to __read_swap_cache_async() > * on the same entry waiting forever here because SWAP_HAS_CACHE > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 3ee8957a46e6..b016ebc43b0d 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -1511,6 +1511,21 @@ int swp_swapcount(swp_entry_t entry) > return count; > } > > +/* > + * Verify that a swap entry has been tagged with SWAP_HAS_CACHE > + */ > +bool swap_has_cache(struct swap_info_struct *si, swp_entry_t entry) > +{ > + pgoff_t offset = swp_offset(entry); > + struct swap_cluster_info *ci; > + bool has_cache; > + > + ci = lock_cluster_or_swap_info(si, offset); > + has_cache = !!(si->swap_map[offset] & SWAP_HAS_CACHE); > + unlock_cluster_or_swap_info(si, ci); > + return has_cache; > +} > + > static bool swap_page_trans_huge_swapped(struct swap_info_struct *si, > swp_entry_t entry, > unsigned int nr_pages)
diff --git a/include/linux/swap.h b/include/linux/swap.h index 11c53692f65f..a374070e05a7 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -492,6 +492,7 @@ extern sector_t swapdev_block(int, pgoff_t); extern int __swap_count(swp_entry_t entry); extern int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry); extern int swp_swapcount(swp_entry_t entry); +extern bool swap_has_cache(struct swap_info_struct *si, swp_entry_t entry); struct swap_info_struct *swp_swap_info(swp_entry_t entry); struct backing_dev_info; extern int init_swap_address_space(unsigned int type, unsigned long nr_pages); @@ -583,6 +584,11 @@ static inline int swp_swapcount(swp_entry_t entry) return 0; } +static inline bool swap_has_cache(struct swap_info_struct *si, swp_entry_t entry) +{ + return false; +} + static inline swp_entry_t folio_alloc_swap(struct folio *folio) { swp_entry_t entry; diff --git a/mm/swap_state.c b/mm/swap_state.c index 642c30d8376c..f117fbf18b59 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -462,6 +462,15 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, if (!swap_swapcount(si, entry) && swap_slot_cache_enabled) goto fail_put_swap; + /* + * Skipping page allocation if SWAP_HAS_CACHE was set, + * just schedule_timeout_uninterruptible and continue to + * acquire page via filemap_get_folio() from swap cache, + * to speedup __read_swap_cache_async. + */ + if (swap_has_cache(si, entry)) + goto skip_alloc; + /* * Get a new folio to read into from swap. Allocate it now, * before marking swap_map SWAP_HAS_CACHE, when -EEXIST will @@ -483,6 +492,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, if (err != -EEXIST) goto fail_put_swap; +skip_alloc: /* * Protect against a recursive call to __read_swap_cache_async() * on the same entry waiting forever here because SWAP_HAS_CACHE diff --git a/mm/swapfile.c b/mm/swapfile.c index 3ee8957a46e6..b016ebc43b0d 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1511,6 +1511,21 @@ int swp_swapcount(swp_entry_t entry) return count; } +/* + * Verify that a swap entry has been tagged with SWAP_HAS_CACHE + */ +bool swap_has_cache(struct swap_info_struct *si, swp_entry_t entry) +{ + pgoff_t offset = swp_offset(entry); + struct swap_cluster_info *ci; + bool has_cache; + + ci = lock_cluster_or_swap_info(si, offset); + has_cache = !!(si->swap_map[offset] & SWAP_HAS_CACHE); + unlock_cluster_or_swap_info(si, ci); + return has_cache; +} + static bool swap_page_trans_huge_swapped(struct swap_info_struct *si, swp_entry_t entry, unsigned int nr_pages)
Based on qemu arm64 - latest kernel + 100M memory + 1024M swapfile. Create 1G anon mmap and set it to shared, and has two processes randomly access the shared memory. When they are racing on swap cache, on average, each "alloc_pages_mpol + swapcache_prepare + folio_put" took about 1475 us. So skip page allocation if SWAP_HAS_CACHE was set, just schedule_timeout_uninterruptible and continue to acquire page via filemap_get_folio() from swap cache, to speedup __read_swap_cache_async. Signed-off-by: Zhaoyu Liu <liuzhaoyu.zackary@bytedance.com> --- Changes in v2: - Fix the patch format and rebase to latest linux-next. --- include/linux/swap.h | 6 ++++++ mm/swap_state.c | 10 ++++++++++ mm/swapfile.c | 15 +++++++++++++++ 3 files changed, 31 insertions(+)