diff mbox series

[v1,02/25] mm/swap: Don't abuse the seqcount latching API

Message ID 20200519214547.352050-3-a.darwish@linutronix.de (mailing list archive)
State New, archived
Headers show
Series seqlock: Extend seqcount API with associated locks | expand

Commit Message

Ahmed S. Darwish May 19, 2020, 9:45 p.m. UTC
Commit eef1a429f234 ("mm/swap.c: piggyback lru_add_drain_all() calls")
implemented an optimization mechanism to exit the to-be-started LRU
drain operation (name it A) if another drain operation *started and
finished* while (A) was blocked on the LRU draining mutex.

This was done through a seqcount latch, which is an abuse of its
semantics:

  1. Seqcount latching should be used for the purpose of switching
     between two storage places with sequence protection to allow
     interruptible, preemptible writer sections. The optimization
     mechanism has absolutely nothing to do with that.

  2. The used raw_write_seqcount_latch() has two smp write memory
     barriers to always insure one consistent storage place out of the
     two storage places available. This extra smp_wmb() is redundant for
     the optimization use case.

Beside the API abuse, the semantics of a latch sequence counter was
force fitted into the optimization. What was actually meant is to track
generations of LRU draining operations, where "current lru draining
generation = x" implies that all generations 0 < n <= x are already
*scheduled* for draining.

Remove the conceptually-inappropriate seqcount latch usage and manually
implement the optimization using a counter and SMP memory barriers.

Link: https://lkml.kernel.org/r/CALYGNiPSr-cxV9MX9czaVh6Wz_gzSv3H_8KPvgjBTGbJywUJpA@mail.gmail.com
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
---
 mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 47 insertions(+), 10 deletions(-)

Comments

Hillf Danton May 20, 2020, 11:24 a.m. UTC | #1
On Tue, 19 May 2020 23:45:24 +0200 "Ahmed S. Darwish" wrote:
> 
> Commit eef1a429f234 ("mm/swap.c: piggyback lru_add_drain_all() calls")
> implemented an optimization mechanism to exit the to-be-started LRU
> drain operation (name it A) if another drain operation *started and
> finished* while (A) was blocked on the LRU draining mutex.
> 
> This was done through a seqcount latch, which is an abuse of its
> semantics:
> 
>   1. Seqcount latching should be used for the purpose of switching
>      between two storage places with sequence protection to allow
>      interruptible, preemptible writer sections. The optimization
>      mechanism has absolutely nothing to do with that.
> 
>   2. The used raw_write_seqcount_latch() has two smp write memory
>      barriers to always insure one consistent storage place out of the
>      two storage places available. This extra smp_wmb() is redundant for
>      the optimization use case.
> 
> Beside the API abuse, the semantics of a latch sequence counter was
> force fitted into the optimization. What was actually meant is to track
> generations of LRU draining operations, where "current lru draining
> generation =3D x" implies that all generations 0 < n <=3D x are already
> *scheduled* for draining.
> 
> Remove the conceptually-inappropriate seqcount latch usage and manually
> implement the optimization using a counter and SMP memory barriers.
> 
> Link: https://lkml.kernel.org/r/CALYGNiPSr-cxV9MX9czaVh6Wz_gzSv3H_8KPvgjBTGbJywUJpA@mail.gmail.com
> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
> ---
>  mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 47 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index bf9a79fed62d..d6910eeed43d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
>   */
>  void lru_add_drain_all(void)
>  {
> -	static seqcount_t seqcount =3D SEQCNT_ZERO(seqcount);
> -	static DEFINE_MUTEX(lock);
> +	/*
> +	 * lru_drain_gen - Current generation of pages that could be in vectors
> +	 *
> +	 * (A) Definition: lru_drain_gen =3D x implies that all generations
> +	 *     0 < n <=3D x are already scheduled for draining.
> +	 *
> +	 * This is an optimization for the highly-contended use case where a
> +	 * user space workload keeps constantly generating a flow of pages
> +	 * for each CPU.
> +	 */
> +	static unsigned int lru_drain_gen;
>  	static struct cpumask has_work;
> -	int cpu, seq;
> +	static DEFINE_MUTEX(lock);
> +	int cpu, this_gen;
> =20
>  	/*
>  	 * Make sure nobody triggers this path before mm_percpu_wq is fully
> @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
>  	if (WARN_ON(!mm_percpu_wq))
>  		return;
> =20
> -	seq =3D raw_read_seqcount_latch(&seqcount);
> +	/*
> +	 * (B) Cache the LRU draining generation number
> +	 *
> +	 * smp_rmb() ensures that the counter is loaded before the mutex is
> +	 * taken. It pairs with the smp_wmb() inside the mutex critical section
> +	 * at (D).
> +	 */
> +	this_gen =3D READ_ONCE(lru_drain_gen);
> +	smp_rmb();
> =20
>  	mutex_lock(&lock);
> =20
>  	/*
> -	 * Piggyback on drain started and finished while we waited for lock:
> -	 * all pages pended at the time of our enter were drained from vectors.
> +	 * (C) Exit the draining operation if a newer generation, from another
> +	 * lru_add_drain_all(), was already scheduled for draining. Check (A).
>  	 */
> -	if (__read_seqcount_retry(&seqcount, seq))
> +	if (unlikely(this_gen !=3D lru_drain_gen))
>  		goto done;
> =20
> -	raw_write_seqcount_latch(&seqcount);
> +	/*
> +	 * (D) Increment generation number
> +	 *
> +	 * Pairs with READ_ONCE() and smp_rmb() at (B), outside of the critical
> +	 * section.
> +	 *
> +	 * This pairing must be done here, before the for_each_online_cpu loop
> +	 * below which drains the page vectors.
> +	 *
> +	 * Let x, y, and z represent some system CPU numbers, where x < y < z.
> +	 * Assume CPU #z is is in the middle of the for_each_online_cpu loop
> +	 * below and has already reached CPU #y's per-cpu data. CPU #x comes
> +	 * along, adds some pages to its per-cpu vectors, then calls
> +	 * lru_add_drain_all().
> +	 *
> +	 * If the paired smp_wmb() below is done at any later step, e.g. after
> +	 * the loop, CPU #x will just exit at (C) and miss flushing out all of
> +	 * its added pages.
> +	 */
> +	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> +	smp_wmb();

Writer is inside mutex.

> =20
>  	cpumask_clear(&has_work);
> -
>  	for_each_online_cpu(cpu) {
>  		struct work_struct *work =3D &per_cpu(lru_add_drain_work, cpu);
> =20
> @@ -766,7 +803,7 @@ void lru_add_drain_all(void)
>  {
>  	lru_add_drain();
>  }
> -#endif
> +#endif /* CONFIG_SMP */
> =20
>  /**
>   * release_pages - batched put_page()
> --=20
> 2.20.1


The tiny diff, inspired by this work, is trying to trylock mutex by
swicthing the sequence counter's readers and writer across the mutex
boundary.

--- a/mm/swap.c
+++ b/mm/swap.c
@@ -714,10 +714,12 @@ static void lru_add_drain_per_cpu(struct
  */
 void lru_add_drain_all(void)
 {
-	static seqcount_t seqcount = SEQCNT_ZERO(seqcount);
+	static unsigned int lru_drain_gen;
 	static DEFINE_MUTEX(lock);
 	static struct cpumask has_work;
-	int cpu, seq;
+	int cpu;
+	int loop = 0;
+	unsigned int seq;
 
 	/*
 	 * Make sure nobody triggers this path before mm_percpu_wq is fully
@@ -726,18 +728,12 @@ void lru_add_drain_all(void)
 	if (WARN_ON(!mm_percpu_wq))
 		return;
 
-	seq = raw_read_seqcount_latch(&seqcount);
+	seq = lru_drain_gen++;
+	smp_mb();
 
-	mutex_lock(&lock);
-
-	/*
-	 * Piggyback on drain started and finished while we waited for lock:
-	 * all pages pended at the time of our enter were drained from vectors.
-	 */
-	if (__read_seqcount_retry(&seqcount, seq))
-		goto done;
-
-	raw_write_seqcount_latch(&seqcount);
+	if (!mutex_trylock(&lock))
+		return;
+more_work:
 
 	cpumask_clear(&has_work);
 
@@ -759,7 +755,11 @@ void lru_add_drain_all(void)
 	for_each_cpu(cpu, &has_work)
 		flush_work(&per_cpu(lru_add_drain_work, cpu));
 
-done:
+	smp_mb();
+	if (++seq != lru_drain_gen && loop++ < 128)
+		goto more_work;
+
+	smp_mb();
 	mutex_unlock(&lock);
 }
 #else
--
Konstantin Khlebnikov May 20, 2020, 12:22 p.m. UTC | #2
On 20/05/2020 00.45, Ahmed S. Darwish wrote:
> Commit eef1a429f234 ("mm/swap.c: piggyback lru_add_drain_all() calls")
> implemented an optimization mechanism to exit the to-be-started LRU
> drain operation (name it A) if another drain operation *started and
> finished* while (A) was blocked on the LRU draining mutex.
> 
> This was done through a seqcount latch, which is an abuse of its
> semantics:
> 
>    1. Seqcount latching should be used for the purpose of switching
>       between two storage places with sequence protection to allow
>       interruptible, preemptible writer sections. The optimization
>       mechanism has absolutely nothing to do with that.
> 
>    2. The used raw_write_seqcount_latch() has two smp write memory
>       barriers to always insure one consistent storage place out of the
>       two storage places available. This extra smp_wmb() is redundant for
>       the optimization use case.
> 
> Beside the API abuse, the semantics of a latch sequence counter was
> force fitted into the optimization. What was actually meant is to track
> generations of LRU draining operations, where "current lru draining
> generation = x" implies that all generations 0 < n <= x are already
> *scheduled* for draining.
> 
> Remove the conceptually-inappropriate seqcount latch usage and manually
> implement the optimization using a counter and SMP memory barriers.

Well, I thought it fits perfectly =)

Maybe it's worth to add helpers with appropriate semantics?
This is pretty common pattern.

> 
> Link: https://lkml.kernel.org/r/CALYGNiPSr-cxV9MX9czaVh6Wz_gzSv3H_8KPvgjBTGbJywUJpA@mail.gmail.com
> Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
> ---
>   mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++----------
>   1 file changed, 47 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index bf9a79fed62d..d6910eeed43d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
>    */
>   void lru_add_drain_all(void)
>   {
> -	static seqcount_t seqcount = SEQCNT_ZERO(seqcount);
> -	static DEFINE_MUTEX(lock);
> +	/*
> +	 * lru_drain_gen - Current generation of pages that could be in vectors
> +	 *
> +	 * (A) Definition: lru_drain_gen = x implies that all generations
> +	 *     0 < n <= x are already scheduled for draining.
> +	 *
> +	 * This is an optimization for the highly-contended use case where a
> +	 * user space workload keeps constantly generating a flow of pages
> +	 * for each CPU.
> +	 */
> +	static unsigned int lru_drain_gen;
>   	static struct cpumask has_work;
> -	int cpu, seq;
> +	static DEFINE_MUTEX(lock);
> +	int cpu, this_gen;
>   
>   	/*
>   	 * Make sure nobody triggers this path before mm_percpu_wq is fully
> @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
>   	if (WARN_ON(!mm_percpu_wq))
>   		return;
>   
> -	seq = raw_read_seqcount_latch(&seqcount);
> +	/*
> +	 * (B) Cache the LRU draining generation number
> +	 *
> +	 * smp_rmb() ensures that the counter is loaded before the mutex is
> +	 * taken. It pairs with the smp_wmb() inside the mutex critical section
> +	 * at (D).
> +	 */
> +	this_gen = READ_ONCE(lru_drain_gen);
> +	smp_rmb();
>   
>   	mutex_lock(&lock);
>   
>   	/*
> -	 * Piggyback on drain started and finished while we waited for lock:
> -	 * all pages pended at the time of our enter were drained from vectors.
> +	 * (C) Exit the draining operation if a newer generation, from another
> +	 * lru_add_drain_all(), was already scheduled for draining. Check (A).
>   	 */
> -	if (__read_seqcount_retry(&seqcount, seq))
> +	if (unlikely(this_gen != lru_drain_gen))
>   		goto done;
>   
> -	raw_write_seqcount_latch(&seqcount);
> +	/*
> +	 * (D) Increment generation number
> +	 *
> +	 * Pairs with READ_ONCE() and smp_rmb() at (B), outside of the critical
> +	 * section.
> +	 *
> +	 * This pairing must be done here, before the for_each_online_cpu loop
> +	 * below which drains the page vectors.
> +	 *
> +	 * Let x, y, and z represent some system CPU numbers, where x < y < z.
> +	 * Assume CPU #z is is in the middle of the for_each_online_cpu loop
> +	 * below and has already reached CPU #y's per-cpu data. CPU #x comes
> +	 * along, adds some pages to its per-cpu vectors, then calls
> +	 * lru_add_drain_all().
> +	 *
> +	 * If the paired smp_wmb() below is done at any later step, e.g. after
> +	 * the loop, CPU #x will just exit at (C) and miss flushing out all of
> +	 * its added pages.
> +	 */
> +	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> +	smp_wmb();
>   
>   	cpumask_clear(&has_work);
> -
>   	for_each_online_cpu(cpu) {
>   		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
>   
> @@ -766,7 +803,7 @@ void lru_add_drain_all(void)
>   {
>   	lru_add_drain();
>   }
> -#endif
> +#endif /* CONFIG_SMP */
>   
>   /**
>    * release_pages - batched put_page()
>
Peter Zijlstra May 20, 2020, 1:05 p.m. UTC | #3
On Wed, May 20, 2020 at 03:22:15PM +0300, Konstantin Khlebnikov wrote:
> On 20/05/2020 00.45, Ahmed S. Darwish wrote:
> > Commit eef1a429f234 ("mm/swap.c: piggyback lru_add_drain_all() calls")
> > implemented an optimization mechanism to exit the to-be-started LRU
> > drain operation (name it A) if another drain operation *started and
> > finished* while (A) was blocked on the LRU draining mutex.

That commit is horrible...

> Well, I thought it fits perfectly =)
> 
> Maybe it's worth to add helpers with appropriate semantics?
> This is pretty common pattern.

Where's more sites?

> > @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
> >   	if (WARN_ON(!mm_percpu_wq))
> >   		return;
> > -	seq = raw_read_seqcount_latch(&seqcount);
> >   	mutex_lock(&lock);
> >   	/*
> > -	 * Piggyback on drain started and finished while we waited for lock:
> > -	 * all pages pended at the time of our enter were drained from vectors.
> >   	 */
> > -	if (__read_seqcount_retry(&seqcount, seq))
> >   		goto done;

Since there is no ordering in raw_read_seqcount_latch(), and
mutex_lock() is an ACQUIRE, there's no guarantee the read actually
happens before the mutex is acquired.

> > -	raw_write_seqcount_latch(&seqcount);
> >   	cpumask_clear(&has_work);
Peter Zijlstra May 22, 2020, 2:57 p.m. UTC | #4
On Tue, May 19, 2020 at 11:45:24PM +0200, Ahmed S. Darwish wrote:
> @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
>   */
>  void lru_add_drain_all(void)
>  {

> +	static unsigned int lru_drain_gen;
>  	static struct cpumask has_work;
> +	static DEFINE_MUTEX(lock);
> +	int cpu, this_gen;
>  
>  	/*
>  	 * Make sure nobody triggers this path before mm_percpu_wq is fully
> @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
>  	if (WARN_ON(!mm_percpu_wq))
>  		return;
>  

> +	this_gen = READ_ONCE(lru_drain_gen);
> +	smp_rmb();

	this_gen = smp_load_acquire(&lru_drain_gen);
>  
>  	mutex_lock(&lock);
>  
>  	/*
> +	 * (C) Exit the draining operation if a newer generation, from another
> +	 * lru_add_drain_all(), was already scheduled for draining. Check (A).
>  	 */
> +	if (unlikely(this_gen != lru_drain_gen))
>  		goto done;
>  

> +	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> +	smp_wmb();

You can leave this smp_wmb() out and rely on the smp_mb() implied by
queue_work_on()'s test_and_set_bit().

>  	cpumask_clear(&has_work);
> -
>  	for_each_online_cpu(cpu) {
>  		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
>  

While you're here, do:

	s/cpumask_set_cpu/__&/

> @@ -766,7 +803,7 @@ void lru_add_drain_all(void)
>  {
>  	lru_add_drain();
>  }
> -#endif
> +#endif /* CONFIG_SMP */
>  
>  /**
>   * release_pages - batched put_page()
Sebastian Andrzej Siewior May 22, 2020, 3:17 p.m. UTC | #5
On 2020-05-22 16:57:07 [+0200], Peter Zijlstra wrote:
> > @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
> >  	if (WARN_ON(!mm_percpu_wq))
> >  		return;
> >  
> 
> > +	this_gen = READ_ONCE(lru_drain_gen);
> > +	smp_rmb();
> 
> 	this_gen = smp_load_acquire(&lru_drain_gen);
> >  
> >  	mutex_lock(&lock);
> >  
> >  	/*
> > +	 * (C) Exit the draining operation if a newer generation, from another
> > +	 * lru_add_drain_all(), was already scheduled for draining. Check (A).
> >  	 */
> > +	if (unlikely(this_gen != lru_drain_gen))
> >  		goto done;
> >  
> 
> > +	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> > +	smp_wmb();
> 
> You can leave this smp_wmb() out and rely on the smp_mb() implied by
> queue_work_on()'s test_and_set_bit().

This is to avoid smp_store_release() ?

Sebastian
Peter Zijlstra May 22, 2020, 4:23 p.m. UTC | #6
On Fri, May 22, 2020 at 05:17:05PM +0200, Sebastian A. Siewior wrote:
> On 2020-05-22 16:57:07 [+0200], Peter Zijlstra wrote:
> > > @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
> > >  	if (WARN_ON(!mm_percpu_wq))
> > >  		return;
> > >  
> > 
> > > +	this_gen = READ_ONCE(lru_drain_gen);
> > > +	smp_rmb();
> > 
> > 	this_gen = smp_load_acquire(&lru_drain_gen);
> > >  
> > >  	mutex_lock(&lock);
> > >  
> > >  	/*
> > > +	 * (C) Exit the draining operation if a newer generation, from another
> > > +	 * lru_add_drain_all(), was already scheduled for draining. Check (A).
> > >  	 */
> > > +	if (unlikely(this_gen != lru_drain_gen))
> > >  		goto done;
> > >  
> > 
> > > +	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> > > +	smp_wmb();
> > 
> > You can leave this smp_wmb() out and rely on the smp_mb() implied by
> > queue_work_on()'s test_and_set_bit().
> 
> This is to avoid smp_store_release() ?

store_release would have the barrier on the other end. If you read the
comments (I so helpfully cut out) you'll see it wants to order against
later stores, not ealier.
Ahmed S. Darwish May 25, 2020, 3:24 p.m. UTC | #7
Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, May 19, 2020 at 11:45:24PM +0200, Ahmed S. Darwish wrote:
> > @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
> >   */
> >  void lru_add_drain_all(void)
> >  {
>

Re-adding cut-out comment for context:

	/*
	 * lru_drain_gen - Current generation of pages that could be in vectors
	 *
	 * (A) Definition: lru_drain_gen = x implies that all generations
	 *     0 < n <= x are already scheduled for draining.
	 *
	 * This is an optimization for the highly-contended use case where a
	 * user space workload keeps constantly generating a flow of pages
	 * for each CPU.
	 */
> > +	static unsigned int lru_drain_gen;
> >  	static struct cpumask has_work;
> > +	static DEFINE_MUTEX(lock);
> > +	int cpu, this_gen;
> >
> >  	/*
> >  	 * Make sure nobody triggers this path before mm_percpu_wq is fully
> > @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
> >  	if (WARN_ON(!mm_percpu_wq))
> >  		return;
> >
>

Re-adding cut-out comment for context:

	/*
	 * (B) Cache the LRU draining generation number
	 *
	 * smp_rmb() ensures that the counter is loaded before the mutex is
	 * taken. It pairs with the smp_wmb() inside the mutex critical section
	 * at (D).
	 */
> > +	this_gen = READ_ONCE(lru_drain_gen);
> > +	smp_rmb();
>
> 	this_gen = smp_load_acquire(&lru_drain_gen);

ACK. will do.

> >
> >  	mutex_lock(&lock);
> >
> >  	/*
> > +	 * (C) Exit the draining operation if a newer generation, from another
> > +	 * lru_add_drain_all(), was already scheduled for draining. Check (A).
> >  	 */
> > +	if (unlikely(this_gen != lru_drain_gen))
> >  		goto done;
> >
>

Re-adding cut-out comment for context:

	/*
	 * (D) Increment generation number
	 *
	 * Pairs with READ_ONCE() and smp_rmb() at (B), outside of the critical
	 * section.
	 *
	 * This pairing must be done here, before the for_each_online_cpu loop
	 * below which drains the page vectors.
	 *
	 * Let x, y, and z represent some system CPU numbers, where x < y < z.
	 * Assume CPU #z is is in the middle of the for_each_online_cpu loop
	 * below and has already reached CPU #y's per-cpu data. CPU #x comes
	 * along, adds some pages to its per-cpu vectors, then calls
	 * lru_add_drain_all().
	 *
	 * If the paired smp_wmb() below is done at any later step, e.g. after
	 * the loop, CPU #x will just exit at (C) and miss flushing out all of
	 * its added pages.
	 */
> > +	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> > +	smp_wmb();
>
> You can leave this smp_wmb() out and rely on the smp_mb() implied by
> queue_work_on()'s test_and_set_bit().
>

Won't this be too implicit?

Isn't it possible that, over the years, queue_work_on() impementation
changes and the test_and_set_bit()/smp_mb() gets removed?

If that happens, this commit will get *silently* broken and the local
CPU pages won't be drained.

> >  	cpumask_clear(&has_work);
> > -
> >  	for_each_online_cpu(cpu) {
> >  		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
> >
>
> While you're here, do:
>
> 	s/cpumask_set_cpu/__&/
>

ACK.

Thanks,

--
Ahmed S. Darwish
Linutronix GmbH
Peter Zijlstra May 25, 2020, 3:45 p.m. UTC | #8
On Mon, May 25, 2020 at 05:24:01PM +0200, Ahmed S. Darwish wrote:
> Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, May 19, 2020 at 11:45:24PM +0200, Ahmed S. Darwish wrote:

> > > +	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> > > +	smp_wmb();
> >
> > You can leave this smp_wmb() out and rely on the smp_mb() implied by
> > queue_work_on()'s test_and_set_bit().
> >
> 
> Won't this be too implicit?
> 
> Isn't it possible that, over the years, queue_work_on() impementation
> changes and the test_and_set_bit()/smp_mb() gets removed?
> 
> If that happens, this commit will get *silently* broken and the local
> CPU pages won't be drained.

Add a comment to queue_work_on() that points here? That way people are
aware.
John Ogness May 25, 2020, 4:10 p.m. UTC | #9
Hi,

This optimization is broken. The main concern here: Is it possible that
lru_add_drain_all() _would_ have drained pagevec X, but then aborted
because another lru_add_drain_all() is underway and that other task will
_not_ drain pagevec X? I claim the answer is yes!

My suggested changes are inline below.

I attached a litmus test to verify it.

On 2020-05-22, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, May 19, 2020 at 11:45:24PM +0200, Ahmed S. Darwish wrote:
>> @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
>>   */
>>  void lru_add_drain_all(void)
>>  {
>
>> +	static unsigned int lru_drain_gen;
>>  	static struct cpumask has_work;
>> +	static DEFINE_MUTEX(lock);
>> +	int cpu, this_gen;
>>  
>>  	/*
>>  	 * Make sure nobody triggers this path before mm_percpu_wq is fully
>> @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
>>  	if (WARN_ON(!mm_percpu_wq))
>>  		return;
>>  

An smp_mb() is needed here.

	/*
	 * Guarantee the pagevec counter stores visible by
	 * this CPU are visible to other CPUs before loading
	 * the current drain generation.
	 */
	smp_mb();

>> +	this_gen = READ_ONCE(lru_drain_gen);
>> +	smp_rmb();
>
> 	this_gen = smp_load_acquire(&lru_drain_gen);
>>  
>>  	mutex_lock(&lock);
>>  
>>  	/*
>> +	 * (C) Exit the draining operation if a newer generation, from another
>> +	 * lru_add_drain_all(), was already scheduled for draining. Check (A).
>>  	 */
>> +	if (unlikely(this_gen != lru_drain_gen))
>>  		goto done;
>>  
>
>> +	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
>> +	smp_wmb();

Instead of smp_wmb(), this needs to be a full memory barrier.

	/*
	 * Guarantee the new drain generation is stored before
	 * loading the pagevec counters.
	 */
	smp_mb();

> You can leave this smp_wmb() out and rely on the smp_mb() implied by
> queue_work_on()'s test_and_set_bit().
>
>>  	cpumask_clear(&has_work);
>> -
>>  	for_each_online_cpu(cpu) {
>>  		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
>>  
>
> While you're here, do:
>
> 	s/cpumask_set_cpu/__&/
>
>> @@ -766,7 +803,7 @@ void lru_add_drain_all(void)
>>  {
>>  	lru_add_drain();
>>  }
>> -#endif
>> +#endif /* CONFIG_SMP */
>>  
>>  /**
>>   * release_pages - batched put_page()

For the litmus test:

1:rx=0             (P1 did not see the pagevec counter)
2:rx=1             (P2 _would_ have seen the pagevec counter)
2:ry1=0 /\ 2:ry2=1 (P2 aborted due to optimization)

Changing the smp_mb() back to smp_wmb() in P1 and removing the smp_mb()
in P2 represents this patch. And it shows that sometimes P2 will abort
even though it would have drained the pagevec and P1 did not drain the
pagevec.

This is ugly as hell. And there maybe other memory barrier types to make
it pretty. But as is, memory barriers are missing.

John Ogness
C lru_add_drain_all

(*
 * x is a pagevec counter
 * y is @lru_drain_gen
 * z is @lock
 *)

{ }

P0(int *x)
{
	// mark pagevec for draining
	WRITE_ONCE(*x, 1);
}

P1(int *x, int *y, int *z)
{
	int rx;
	int rz;

	// mutex_lock(&lock);
	rz = cmpxchg_acquire(z, 0, 1);
	if (rz == 0) {

		// WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
		WRITE_ONCE(*y, 1);

		// guarantee lru_drain_gen store before loading pagevec
		smp_mb();

		// if (pagevec_count(...))
		rx = READ_ONCE(*x);

		// mutex_unlock(&lock);
		rz = cmpxchg_release(z, 1, 2);
	}
}

P2(int *x, int *y, int *z)
{
	int rx;
	int ry1;
	int ry2;
	int rz;

	// the pagevec counter as visible now to this CPU
	rx = READ_ONCE(*x);

	// guarantee pagevec store before loading lru_drain_gen
	smp_mb();

	// this_gen = READ_ONCE(lru_drain_gen); smp_rmb();
	ry1 = smp_load_acquire(y);

	// mutex_lock(&lock) - acquired after P1
	rz = cmpxchg_acquire(z, 2, 3);
	if (rz == 2) {

		// if (unlikely(this_gen != lru_drain_gen))
		ry2 = READ_ONCE(*y);
	}
}

locations [x; y; z]
exists (1:rx=0 /\ 2:rx=1 /\ 2:ry1=0 /\ 2:ry2=1)
diff mbox series

Patch

diff --git a/mm/swap.c b/mm/swap.c
index bf9a79fed62d..d6910eeed43d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -713,10 +713,20 @@  static void lru_add_drain_per_cpu(struct work_struct *dummy)
  */
 void lru_add_drain_all(void)
 {
-	static seqcount_t seqcount = SEQCNT_ZERO(seqcount);
-	static DEFINE_MUTEX(lock);
+	/*
+	 * lru_drain_gen - Current generation of pages that could be in vectors
+	 *
+	 * (A) Definition: lru_drain_gen = x implies that all generations
+	 *     0 < n <= x are already scheduled for draining.
+	 *
+	 * This is an optimization for the highly-contended use case where a
+	 * user space workload keeps constantly generating a flow of pages
+	 * for each CPU.
+	 */
+	static unsigned int lru_drain_gen;
 	static struct cpumask has_work;
-	int cpu, seq;
+	static DEFINE_MUTEX(lock);
+	int cpu, this_gen;
 
 	/*
 	 * Make sure nobody triggers this path before mm_percpu_wq is fully
@@ -725,21 +735,48 @@  void lru_add_drain_all(void)
 	if (WARN_ON(!mm_percpu_wq))
 		return;
 
-	seq = raw_read_seqcount_latch(&seqcount);
+	/*
+	 * (B) Cache the LRU draining generation number
+	 *
+	 * smp_rmb() ensures that the counter is loaded before the mutex is
+	 * taken. It pairs with the smp_wmb() inside the mutex critical section
+	 * at (D).
+	 */
+	this_gen = READ_ONCE(lru_drain_gen);
+	smp_rmb();
 
 	mutex_lock(&lock);
 
 	/*
-	 * Piggyback on drain started and finished while we waited for lock:
-	 * all pages pended at the time of our enter were drained from vectors.
+	 * (C) Exit the draining operation if a newer generation, from another
+	 * lru_add_drain_all(), was already scheduled for draining. Check (A).
 	 */
-	if (__read_seqcount_retry(&seqcount, seq))
+	if (unlikely(this_gen != lru_drain_gen))
 		goto done;
 
-	raw_write_seqcount_latch(&seqcount);
+	/*
+	 * (D) Increment generation number
+	 *
+	 * Pairs with READ_ONCE() and smp_rmb() at (B), outside of the critical
+	 * section.
+	 *
+	 * This pairing must be done here, before the for_each_online_cpu loop
+	 * below which drains the page vectors.
+	 *
+	 * Let x, y, and z represent some system CPU numbers, where x < y < z.
+	 * Assume CPU #z is is in the middle of the for_each_online_cpu loop
+	 * below and has already reached CPU #y's per-cpu data. CPU #x comes
+	 * along, adds some pages to its per-cpu vectors, then calls
+	 * lru_add_drain_all().
+	 *
+	 * If the paired smp_wmb() below is done at any later step, e.g. after
+	 * the loop, CPU #x will just exit at (C) and miss flushing out all of
+	 * its added pages.
+	 */
+	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
+	smp_wmb();
 
 	cpumask_clear(&has_work);
-
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
@@ -766,7 +803,7 @@  void lru_add_drain_all(void)
 {
 	lru_add_drain();
 }
-#endif
+#endif /* CONFIG_SMP */
 
 /**
  * release_pages - batched put_page()