diff mbox series

mm: add maybe_lru_add_drain() that only drains when threshold is exceeded

Message ID 20241218115604.7e56bedb@fangorn (mailing list archive)
State New
Headers show
Series mm: add maybe_lru_add_drain() that only drains when threshold is exceeded | expand

Commit Message

Rik van Riel Dec. 18, 2024, 4:56 p.m. UTC
The lru_add_drain() call in zap_page_range_single() always takes some locks,
and will drain the buffers even when there is only a single page pending.

We probably don't need to do that, since we already deal fine with zap_page_range
encountering pages that are still in the buffers of other CPUs.

On an AMD Milan CPU, will-it-scale the tlb_flush2_threads test performance with
36 threads (one for each core) increases from 526k to 730k loops per second.

The overhead in this case was on the lruvec locks, taking the lock to flush
a single page. There may be other spots where this variant could be appropriate.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 include/linux/swap.h |  1 +
 mm/memory.c          |  2 +-
 mm/swap.c            | 18 ++++++++++++++++++
 mm/swap_state.c      |  2 +-
 4 files changed, 21 insertions(+), 2 deletions(-)

Comments

Shakeel Butt Dec. 18, 2024, 8:20 p.m. UTC | #1
On Wed, Dec 18, 2024 at 11:56:04AM -0500, Rik van Riel wrote:
[...]
>  
> +static bool should_lru_add_drain(void)
> +{
> +	struct cpu_fbatches *fbatches = this_cpu_ptr(&cpu_fbatches);

You will need either a local_lock or preempt_disable to access the per
cpu batches.

> +	int pending = folio_batch_count(&fbatches->lru_add);
> +	pending += folio_batch_count(&fbatches->lru_deactivate);
> +	pending += folio_batch_count(&fbatches->lru_deactivate_file);
> +	pending += folio_batch_count(&fbatches->lru_lazyfree);
> +
> +	/* Don't bother draining unless we have several pages pending. */
> +	return pending > SWAP_CLUSTER_MAX;
> +}
> +
> +void maybe_lru_add_drain(void)

Later it might also make sense to see if other users of lru_add_drain()
should be fine with maybe_lru_add_drain() as well.

> +{
> +	if (should_lru_add_drain())
> +		lru_add_drain();
> +}
> +
Rik van Riel Dec. 19, 2024, 3:13 a.m. UTC | #2
On Wed, 2024-12-18 at 12:20 -0800, Shakeel Butt wrote:
> On Wed, Dec 18, 2024 at 11:56:04AM -0500, Rik van Riel wrote:
> [...]
> >  
> > +static bool should_lru_add_drain(void)
> > +{
> > +	struct cpu_fbatches *fbatches =
> > this_cpu_ptr(&cpu_fbatches);
> 
> You will need either a local_lock or preempt_disable to access the
> per
> cpu batches.
> 
Why is that? Can the per-cpu batches disappear on us
while we're trying to access them, without that
local_lock or preempt_disable?

I'm not trying to protect against accidentally reading
the wrong CPU's numbers, since we could be preempted
and migrated to another CPU immediately after returning
from should_lru_add_drain, but do want to keep things
safe against other potential issues.


> > +	int pending = folio_batch_count(&fbatches->lru_add);
> > +	pending += folio_batch_count(&fbatches->lru_deactivate);
> > +	pending += folio_batch_count(&fbatches-
> > >lru_deactivate_file);
> > +	pending += folio_batch_count(&fbatches->lru_lazyfree);
> > +
> > +	/* Don't bother draining unless we have several pages
> > pending. */
> > +	return pending > SWAP_CLUSTER_MAX;
> > +}
> > +
> > +void maybe_lru_add_drain(void)
> 
> Later it might also make sense to see if other users of
> lru_add_drain()
> should be fine with maybe_lru_add_drain() as well.

Agreed. I think there are a few other users where this
could make sense, including munmap.
David Hildenbrand Dec. 19, 2024, 1:47 p.m. UTC | #3
On 18.12.24 17:56, Rik van Riel wrote:
> The lru_add_drain() call in zap_page_range_single() always takes some locks,
> and will drain the buffers even when there is only a single page pending.
> 
> We probably don't need to do that, since we already deal fine with zap_page_range
> encountering pages that are still in the buffers of other CPUs.
> 
> On an AMD Milan CPU, will-it-scale the tlb_flush2_threads test performance with
> 36 threads (one for each core) increases from 526k to 730k loops per second.
> 
> The overhead in this case was on the lruvec locks, taking the lock to flush
> a single page. There may be other spots where this variant could be appropriate.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>   include/linux/swap.h |  1 +
>   mm/memory.c          |  2 +-
>   mm/swap.c            | 18 ++++++++++++++++++
>   mm/swap_state.c      |  2 +-
>   4 files changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index dd5ac833150d..a2f06317bd4b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -391,6 +391,7 @@ static inline void lru_cache_enable(void)
>   }
>   
>   extern void lru_cache_disable(void);
> +extern void maybe_lru_add_drain(void);
>   extern void lru_add_drain(void);
>   extern void lru_add_drain_cpu(int cpu);
>   extern void lru_add_drain_cpu_zone(struct zone *zone);
> diff --git a/mm/memory.c b/mm/memory.c
> index 2635f7bceab5..1767c65b93ad 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1919,7 +1919,7 @@ void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
>   	struct mmu_notifier_range range;
>   	struct mmu_gather tlb;
>   
> -	lru_add_drain();
> +	maybe_lru_add_drain();
>   	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
>   				address, end);
>   	hugetlb_zap_begin(vma, &range.start, &range.end);
> diff --git a/mm/swap.c b/mm/swap.c
> index 9caf6b017cf0..001664a652ff 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -777,6 +777,24 @@ void lru_add_drain(void)
>   	mlock_drain_local();
>   }
>   
> +static bool should_lru_add_drain(void)
> +{
> +	struct cpu_fbatches *fbatches = this_cpu_ptr(&cpu_fbatches);
> +	int pending = folio_batch_count(&fbatches->lru_add);
> +	pending += folio_batch_count(&fbatches->lru_deactivate);
> +	pending += folio_batch_count(&fbatches->lru_deactivate_file);
> +	pending += folio_batch_count(&fbatches->lru_lazyfree);
> +
> +	/* Don't bother draining unless we have several pages pending. */
> +	return pending > SWAP_CLUSTER_MAX;
> +}
> +
> +void maybe_lru_add_drain(void)
> +{
> +	if (should_lru_add_drain())
> +		lru_add_drain();
> +}
> +
>   /*
>    * It's called from per-cpu workqueue context in SMP case so
>    * lru_add_drain_cpu and invalidate_bh_lrus_cpu should run on
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 3a0cf965f32b..1ae4cd7b041e 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -317,7 +317,7 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
>   	struct folio_batch folios;
>   	unsigned int refs[PAGEVEC_SIZE];
>   
> -	lru_add_drain();
> +	maybe_lru_add_drain();

I'm wondering about the reason+effect of this existing call.

Seems to date back to the beginning of git.

Likely it doesn't make sense to have effectively-free pages in the 
LRU+mlock cache. But then, this only considers the local CPU LRU/mlock 
caches ... hmmm

So .... do we need this at all? :)
Rik van Riel Dec. 19, 2024, 2:11 p.m. UTC | #4
On Thu, 2024-12-19 at 14:47 +0100, David Hildenbrand wrote:
> 
> > +++ b/mm/swap_state.c
> > @@ -317,7 +317,7 @@ void free_pages_and_swap_cache(struct
> > encoded_page **pages, int nr)
> >   	struct folio_batch folios;
> >   	unsigned int refs[PAGEVEC_SIZE];
> >   
> > -	lru_add_drain();
> > +	maybe_lru_add_drain();
> 
> I'm wondering about the reason+effect of this existing call.
> 
> Seems to date back to the beginning of git.
> 
> Likely it doesn't make sense to have effectively-free pages in the 
> LRU+mlock cache. But then, this only considers the local CPU
> LRU/mlock 
> caches ... hmmm
> 
> So .... do we need this at all? :)
> 
That is a very good question.

I think we need to free those pending pages at
some point. They can't accumulate there forever.
However, I am not sure where those points should
be. 

I can think of a few considerations:
1) We should consider approximate LRU ordering,
   and move pages onto the LRU every once in a
   while.
2) When we are trying to free memory, we should
   maybe ensure not too many pages are in these
   temporary buffers?
3) For lock batching reasons, we do not want to
   drain these buffers too frequently.

My patch takes a small step in the direction of
more batching, but maybe we can take a larger one?
Shakeel Butt Dec. 19, 2024, 5 p.m. UTC | #5
On Wed, Dec 18, 2024 at 10:13:30PM -0500, Rik van Riel wrote:
> On Wed, 2024-12-18 at 12:20 -0800, Shakeel Butt wrote:
> > On Wed, Dec 18, 2024 at 11:56:04AM -0500, Rik van Riel wrote:
> > [...]
> > >  
> > > +static bool should_lru_add_drain(void)
> > > +{
> > > +	struct cpu_fbatches *fbatches =
> > > this_cpu_ptr(&cpu_fbatches);
> > 
> > You will need either a local_lock or preempt_disable to access the
> > per
> > cpu batches.
> > 
> Why is that? Can the per-cpu batches disappear on us
> while we're trying to access them, without that
> local_lock or preempt_disable?
> 
> I'm not trying to protect against accidentally reading
> the wrong CPU's numbers, since we could be preempted
> and migrated to another CPU immediately after returning
> from should_lru_add_drain, but do want to keep things
> safe against other potential issues.
> 

I was thinking of task migration across CPUs but it can still happen
while zap is ongoing. Yeah we might need to reevaluate if we really need
to drain here or do draining periodically.
David Hildenbrand Dec. 19, 2024, 5:23 p.m. UTC | #6
On 19.12.24 15:11, Rik van Riel wrote:
> On Thu, 2024-12-19 at 14:47 +0100, David Hildenbrand wrote:
>>
>>> +++ b/mm/swap_state.c
>>> @@ -317,7 +317,7 @@ void free_pages_and_swap_cache(struct
>>> encoded_page **pages, int nr)
>>>    	struct folio_batch folios;
>>>    	unsigned int refs[PAGEVEC_SIZE];
>>>    
>>> -	lru_add_drain();
>>> +	maybe_lru_add_drain();
>>
>> I'm wondering about the reason+effect of this existing call.
>>
>> Seems to date back to the beginning of git.
>>
>> Likely it doesn't make sense to have effectively-free pages in the
>> LRU+mlock cache. But then, this only considers the local CPU
>> LRU/mlock
>> caches ... hmmm
>>
>> So .... do we need this at all? :)
>>
> That is a very good question.
> 
> I think we need to free those pending pages at
> some point. They can't accumulate there forever.
> However, I am not sure where those points should
> be.

The number of entries are limited, so it will regularly be drained under 
normal system operation. Only if we manage to not place any pages on the 
local LRU cache, then they would in fact get stranded there until 
someone does a lru_add_drain_all(), or we perform another operation that 
calls lru_add_drain() on this CPU.

I would assume memory reclaim would drain as well, and I see some calls 
in vmscan.c


Interestingly, we only use the LRU cache for small folios. Even order-1 
folios never strand there. So when freeing a bunch of large folios,
draining might not make sense at all.

> 
> I can think of a few considerations:
> 1) We should consider approximate LRU ordering,
>     and move pages onto the LRU every once in a
>     while.
 > 2) When we are trying to free memory, we should>     maybe ensure not 
too many pages are in these
>     temporary buffers?

folio_batch_add() drains if folio_batch_space() returns 0 (no slots left).

Apparently we have PAGEVEC_SIZE slots, which is 31 ... 31 * PAGE_SIZE 
stranded there.

> 3) For lock batching reasons, we do not want to
>     drain these buffers too frequently.

Yes. I know that we drain in places before we perform some action that 
requires these folios to be marked as LRU, for example, if we want to 
isolate them.

> 
> My patch takes a small step in the direction of
> more batching, but maybe we can take a larger one?

Yes, this "drain on the page freeing path" looks a bit odd to me, but 
maybe there is a good reason why it's been around for decades ... :)
Rik van Riel Dec. 19, 2024, 5:50 p.m. UTC | #7
On Thu, 2024-12-19 at 18:23 +0100, David Hildenbrand wrote:
> On 19.12.24 15:11, Rik van Riel wrote:
> > 
> > I think we need to free those pending pages at
> > some point. They can't accumulate there forever.
> > However, I am not sure where those points should
> > be.
> 
> The number of entries are limited, 

...
> folio_batch_add() drains if folio_batch_space() returns 0 (no slots
> left).
> 
> Apparently we have PAGEVEC_SIZE slots, which is 31 ... 31 * PAGE_SIZE
> stranded there.

Sure enough! That was the piece I was missing last night.

I guess we can get away with just draining these in the
reclaim and compaction paths, and not touch them the
rest of the time?

That could be a nice improvement in some situations.
diff mbox series

Patch

diff --git a/include/linux/swap.h b/include/linux/swap.h
index dd5ac833150d..a2f06317bd4b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -391,6 +391,7 @@  static inline void lru_cache_enable(void)
 }
 
 extern void lru_cache_disable(void);
+extern void maybe_lru_add_drain(void);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_cpu_zone(struct zone *zone);
diff --git a/mm/memory.c b/mm/memory.c
index 2635f7bceab5..1767c65b93ad 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1919,7 +1919,7 @@  void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
 	struct mmu_notifier_range range;
 	struct mmu_gather tlb;
 
-	lru_add_drain();
+	maybe_lru_add_drain();
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
 				address, end);
 	hugetlb_zap_begin(vma, &range.start, &range.end);
diff --git a/mm/swap.c b/mm/swap.c
index 9caf6b017cf0..001664a652ff 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -777,6 +777,24 @@  void lru_add_drain(void)
 	mlock_drain_local();
 }
 
+static bool should_lru_add_drain(void)
+{
+	struct cpu_fbatches *fbatches = this_cpu_ptr(&cpu_fbatches);
+	int pending = folio_batch_count(&fbatches->lru_add);
+	pending += folio_batch_count(&fbatches->lru_deactivate);
+	pending += folio_batch_count(&fbatches->lru_deactivate_file);
+	pending += folio_batch_count(&fbatches->lru_lazyfree);
+
+	/* Don't bother draining unless we have several pages pending. */
+	return pending > SWAP_CLUSTER_MAX;
+}
+
+void maybe_lru_add_drain(void)
+{
+	if (should_lru_add_drain())
+		lru_add_drain();
+}
+
 /*
  * It's called from per-cpu workqueue context in SMP case so
  * lru_add_drain_cpu and invalidate_bh_lrus_cpu should run on
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3a0cf965f32b..1ae4cd7b041e 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -317,7 +317,7 @@  void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
 	struct folio_batch folios;
 	unsigned int refs[PAGEVEC_SIZE];
 
-	lru_add_drain();
+	maybe_lru_add_drain();
 	folio_batch_init(&folios);
 	for (int i = 0; i < nr; i++) {
 		struct folio *folio = page_folio(encoded_page_ptr(pages[i]));