Message ID | 20241218115604.7e56bedb@fangorn (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: add maybe_lru_add_drain() that only drains when threshold is exceeded | expand |
On Wed, Dec 18, 2024 at 11:56:04AM -0500, Rik van Riel wrote: [...] > > +static bool should_lru_add_drain(void) > +{ > + struct cpu_fbatches *fbatches = this_cpu_ptr(&cpu_fbatches); You will need either a local_lock or preempt_disable to access the per cpu batches. > + int pending = folio_batch_count(&fbatches->lru_add); > + pending += folio_batch_count(&fbatches->lru_deactivate); > + pending += folio_batch_count(&fbatches->lru_deactivate_file); > + pending += folio_batch_count(&fbatches->lru_lazyfree); > + > + /* Don't bother draining unless we have several pages pending. */ > + return pending > SWAP_CLUSTER_MAX; > +} > + > +void maybe_lru_add_drain(void) Later it might also make sense to see if other users of lru_add_drain() should be fine with maybe_lru_add_drain() as well. > +{ > + if (should_lru_add_drain()) > + lru_add_drain(); > +} > +
On Wed, 2024-12-18 at 12:20 -0800, Shakeel Butt wrote: > On Wed, Dec 18, 2024 at 11:56:04AM -0500, Rik van Riel wrote: > [...] > > > > +static bool should_lru_add_drain(void) > > +{ > > + struct cpu_fbatches *fbatches = > > this_cpu_ptr(&cpu_fbatches); > > You will need either a local_lock or preempt_disable to access the > per > cpu batches. > Why is that? Can the per-cpu batches disappear on us while we're trying to access them, without that local_lock or preempt_disable? I'm not trying to protect against accidentally reading the wrong CPU's numbers, since we could be preempted and migrated to another CPU immediately after returning from should_lru_add_drain, but do want to keep things safe against other potential issues. > > + int pending = folio_batch_count(&fbatches->lru_add); > > + pending += folio_batch_count(&fbatches->lru_deactivate); > > + pending += folio_batch_count(&fbatches- > > >lru_deactivate_file); > > + pending += folio_batch_count(&fbatches->lru_lazyfree); > > + > > + /* Don't bother draining unless we have several pages > > pending. */ > > + return pending > SWAP_CLUSTER_MAX; > > +} > > + > > +void maybe_lru_add_drain(void) > > Later it might also make sense to see if other users of > lru_add_drain() > should be fine with maybe_lru_add_drain() as well. Agreed. I think there are a few other users where this could make sense, including munmap.
On 18.12.24 17:56, Rik van Riel wrote: > The lru_add_drain() call in zap_page_range_single() always takes some locks, > and will drain the buffers even when there is only a single page pending. > > We probably don't need to do that, since we already deal fine with zap_page_range > encountering pages that are still in the buffers of other CPUs. > > On an AMD Milan CPU, will-it-scale the tlb_flush2_threads test performance with > 36 threads (one for each core) increases from 526k to 730k loops per second. > > The overhead in this case was on the lruvec locks, taking the lock to flush > a single page. There may be other spots where this variant could be appropriate. > > Signed-off-by: Rik van Riel <riel@surriel.com> > --- > include/linux/swap.h | 1 + > mm/memory.c | 2 +- > mm/swap.c | 18 ++++++++++++++++++ > mm/swap_state.c | 2 +- > 4 files changed, 21 insertions(+), 2 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index dd5ac833150d..a2f06317bd4b 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -391,6 +391,7 @@ static inline void lru_cache_enable(void) > } > > extern void lru_cache_disable(void); > +extern void maybe_lru_add_drain(void); > extern void lru_add_drain(void); > extern void lru_add_drain_cpu(int cpu); > extern void lru_add_drain_cpu_zone(struct zone *zone); > diff --git a/mm/memory.c b/mm/memory.c > index 2635f7bceab5..1767c65b93ad 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1919,7 +1919,7 @@ void zap_page_range_single(struct vm_area_struct *vma, unsigned long address, > struct mmu_notifier_range range; > struct mmu_gather tlb; > > - lru_add_drain(); > + maybe_lru_add_drain(); > mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, > address, end); > hugetlb_zap_begin(vma, &range.start, &range.end); > diff --git a/mm/swap.c b/mm/swap.c > index 9caf6b017cf0..001664a652ff 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -777,6 +777,24 @@ void lru_add_drain(void) > mlock_drain_local(); > } > > +static bool should_lru_add_drain(void) > +{ > + struct cpu_fbatches *fbatches = this_cpu_ptr(&cpu_fbatches); > + int pending = folio_batch_count(&fbatches->lru_add); > + pending += folio_batch_count(&fbatches->lru_deactivate); > + pending += folio_batch_count(&fbatches->lru_deactivate_file); > + pending += folio_batch_count(&fbatches->lru_lazyfree); > + > + /* Don't bother draining unless we have several pages pending. */ > + return pending > SWAP_CLUSTER_MAX; > +} > + > +void maybe_lru_add_drain(void) > +{ > + if (should_lru_add_drain()) > + lru_add_drain(); > +} > + > /* > * It's called from per-cpu workqueue context in SMP case so > * lru_add_drain_cpu and invalidate_bh_lrus_cpu should run on > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 3a0cf965f32b..1ae4cd7b041e 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -317,7 +317,7 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr) > struct folio_batch folios; > unsigned int refs[PAGEVEC_SIZE]; > > - lru_add_drain(); > + maybe_lru_add_drain(); I'm wondering about the reason+effect of this existing call. Seems to date back to the beginning of git. Likely it doesn't make sense to have effectively-free pages in the LRU+mlock cache. But then, this only considers the local CPU LRU/mlock caches ... hmmm So .... do we need this at all? :)
On Thu, 2024-12-19 at 14:47 +0100, David Hildenbrand wrote: > > > +++ b/mm/swap_state.c > > @@ -317,7 +317,7 @@ void free_pages_and_swap_cache(struct > > encoded_page **pages, int nr) > > struct folio_batch folios; > > unsigned int refs[PAGEVEC_SIZE]; > > > > - lru_add_drain(); > > + maybe_lru_add_drain(); > > I'm wondering about the reason+effect of this existing call. > > Seems to date back to the beginning of git. > > Likely it doesn't make sense to have effectively-free pages in the > LRU+mlock cache. But then, this only considers the local CPU > LRU/mlock > caches ... hmmm > > So .... do we need this at all? :) > That is a very good question. I think we need to free those pending pages at some point. They can't accumulate there forever. However, I am not sure where those points should be. I can think of a few considerations: 1) We should consider approximate LRU ordering, and move pages onto the LRU every once in a while. 2) When we are trying to free memory, we should maybe ensure not too many pages are in these temporary buffers? 3) For lock batching reasons, we do not want to drain these buffers too frequently. My patch takes a small step in the direction of more batching, but maybe we can take a larger one?
On Wed, Dec 18, 2024 at 10:13:30PM -0500, Rik van Riel wrote: > On Wed, 2024-12-18 at 12:20 -0800, Shakeel Butt wrote: > > On Wed, Dec 18, 2024 at 11:56:04AM -0500, Rik van Riel wrote: > > [...] > > > > > > +static bool should_lru_add_drain(void) > > > +{ > > > + struct cpu_fbatches *fbatches = > > > this_cpu_ptr(&cpu_fbatches); > > > > You will need either a local_lock or preempt_disable to access the > > per > > cpu batches. > > > Why is that? Can the per-cpu batches disappear on us > while we're trying to access them, without that > local_lock or preempt_disable? > > I'm not trying to protect against accidentally reading > the wrong CPU's numbers, since we could be preempted > and migrated to another CPU immediately after returning > from should_lru_add_drain, but do want to keep things > safe against other potential issues. > I was thinking of task migration across CPUs but it can still happen while zap is ongoing. Yeah we might need to reevaluate if we really need to drain here or do draining periodically.
On 19.12.24 15:11, Rik van Riel wrote: > On Thu, 2024-12-19 at 14:47 +0100, David Hildenbrand wrote: >> >>> +++ b/mm/swap_state.c >>> @@ -317,7 +317,7 @@ void free_pages_and_swap_cache(struct >>> encoded_page **pages, int nr) >>> struct folio_batch folios; >>> unsigned int refs[PAGEVEC_SIZE]; >>> >>> - lru_add_drain(); >>> + maybe_lru_add_drain(); >> >> I'm wondering about the reason+effect of this existing call. >> >> Seems to date back to the beginning of git. >> >> Likely it doesn't make sense to have effectively-free pages in the >> LRU+mlock cache. But then, this only considers the local CPU >> LRU/mlock >> caches ... hmmm >> >> So .... do we need this at all? :) >> > That is a very good question. > > I think we need to free those pending pages at > some point. They can't accumulate there forever. > However, I am not sure where those points should > be. The number of entries are limited, so it will regularly be drained under normal system operation. Only if we manage to not place any pages on the local LRU cache, then they would in fact get stranded there until someone does a lru_add_drain_all(), or we perform another operation that calls lru_add_drain() on this CPU. I would assume memory reclaim would drain as well, and I see some calls in vmscan.c Interestingly, we only use the LRU cache for small folios. Even order-1 folios never strand there. So when freeing a bunch of large folios, draining might not make sense at all. > > I can think of a few considerations: > 1) We should consider approximate LRU ordering, > and move pages onto the LRU every once in a > while. > 2) When we are trying to free memory, we should> maybe ensure not too many pages are in these > temporary buffers? folio_batch_add() drains if folio_batch_space() returns 0 (no slots left). Apparently we have PAGEVEC_SIZE slots, which is 31 ... 31 * PAGE_SIZE stranded there. > 3) For lock batching reasons, we do not want to > drain these buffers too frequently. Yes. I know that we drain in places before we perform some action that requires these folios to be marked as LRU, for example, if we want to isolate them. > > My patch takes a small step in the direction of > more batching, but maybe we can take a larger one? Yes, this "drain on the page freeing path" looks a bit odd to me, but maybe there is a good reason why it's been around for decades ... :)
On Thu, 2024-12-19 at 18:23 +0100, David Hildenbrand wrote: > On 19.12.24 15:11, Rik van Riel wrote: > > > > I think we need to free those pending pages at > > some point. They can't accumulate there forever. > > However, I am not sure where those points should > > be. > > The number of entries are limited, ... > folio_batch_add() drains if folio_batch_space() returns 0 (no slots > left). > > Apparently we have PAGEVEC_SIZE slots, which is 31 ... 31 * PAGE_SIZE > stranded there. Sure enough! That was the piece I was missing last night. I guess we can get away with just draining these in the reclaim and compaction paths, and not touch them the rest of the time? That could be a nice improvement in some situations.
diff --git a/include/linux/swap.h b/include/linux/swap.h index dd5ac833150d..a2f06317bd4b 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -391,6 +391,7 @@ static inline void lru_cache_enable(void) } extern void lru_cache_disable(void); +extern void maybe_lru_add_drain(void); extern void lru_add_drain(void); extern void lru_add_drain_cpu(int cpu); extern void lru_add_drain_cpu_zone(struct zone *zone); diff --git a/mm/memory.c b/mm/memory.c index 2635f7bceab5..1767c65b93ad 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1919,7 +1919,7 @@ void zap_page_range_single(struct vm_area_struct *vma, unsigned long address, struct mmu_notifier_range range; struct mmu_gather tlb; - lru_add_drain(); + maybe_lru_add_drain(); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, address, end); hugetlb_zap_begin(vma, &range.start, &range.end); diff --git a/mm/swap.c b/mm/swap.c index 9caf6b017cf0..001664a652ff 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -777,6 +777,24 @@ void lru_add_drain(void) mlock_drain_local(); } +static bool should_lru_add_drain(void) +{ + struct cpu_fbatches *fbatches = this_cpu_ptr(&cpu_fbatches); + int pending = folio_batch_count(&fbatches->lru_add); + pending += folio_batch_count(&fbatches->lru_deactivate); + pending += folio_batch_count(&fbatches->lru_deactivate_file); + pending += folio_batch_count(&fbatches->lru_lazyfree); + + /* Don't bother draining unless we have several pages pending. */ + return pending > SWAP_CLUSTER_MAX; +} + +void maybe_lru_add_drain(void) +{ + if (should_lru_add_drain()) + lru_add_drain(); +} + /* * It's called from per-cpu workqueue context in SMP case so * lru_add_drain_cpu and invalidate_bh_lrus_cpu should run on diff --git a/mm/swap_state.c b/mm/swap_state.c index 3a0cf965f32b..1ae4cd7b041e 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -317,7 +317,7 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr) struct folio_batch folios; unsigned int refs[PAGEVEC_SIZE]; - lru_add_drain(); + maybe_lru_add_drain(); folio_batch_init(&folios); for (int i = 0; i < nr; i++) { struct folio *folio = page_folio(encoded_page_ptr(pages[i]));
The lru_add_drain() call in zap_page_range_single() always takes some locks, and will drain the buffers even when there is only a single page pending. We probably don't need to do that, since we already deal fine with zap_page_range encountering pages that are still in the buffers of other CPUs. On an AMD Milan CPU, will-it-scale the tlb_flush2_threads test performance with 36 threads (one for each core) increases from 526k to 730k loops per second. The overhead in this case was on the lruvec locks, taking the lock to flush a single page. There may be other spots where this variant could be appropriate. Signed-off-by: Rik van Riel <riel@surriel.com> --- include/linux/swap.h | 1 + mm/memory.c | 2 +- mm/swap.c | 18 ++++++++++++++++++ mm/swap_state.c | 2 +- 4 files changed, 21 insertions(+), 2 deletions(-)