diff mbox series

[2/6] mm/swap: Introduce alternative per-cpu LRU cache locking

Message ID 20210921161323.607817-3-nsaenzju@redhat.com (mailing list archive)
State New
Headers show
Series mm: Remote LRU per-cpu pagevec cache/per-cpu page list drain support | expand

Commit Message

Nicolas Saenz Julienne Sept. 21, 2021, 4:13 p.m. UTC
'struct lru_pvecs' and 'struct lru_rotate' per-CPU pagevecs are
protected using local locks. While performance savvy, this doesn't
allow for remote access to these structures. CPUs requiring system-wide
LRU cache drains get around this by scheduling drain work on all CPUs.
That said, some select setups like systems with NOHZ_FULL CPUs, aren't
well suited to this, as they can't handle interruptions of any sort.

To mitigate this, introduce an alternative locking scheme using
spinlocks that will permit remotely accessing these per-CPU pagevecs.
It's disabled by default, with no functional change to regular users,
and enabled through the 'remote_pcpu_cache_access' static key. Upcoming
patches will make use of this static key.

The naming of the static key is left vague on purpose as it is planned
to also enable a similar locking setup to access mm/page_alloc.c's
per-cpu page lists remotely.

This is based on previous work by Thomas Gleixner, Anna-Maria Gleixner,
and Sebastian Andrzej Siewior[1].

[1] https://patchwork.kernel.org/project/linux-mm/patch/20190424111208.24459-3-bigeasy@linutronix.de/
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
---
 mm/internal.h |   2 +
 mm/swap.c     | 152 +++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 128 insertions(+), 26 deletions(-)

Comments

Peter Zijlstra Sept. 21, 2021, 10:03 p.m. UTC | #1
On Tue, Sep 21, 2021 at 06:13:20PM +0200, Nicolas Saenz Julienne wrote:
> +static inline void lru_cache_lock(struct lru_cache_locks *locks)
> +{
> +	if (static_branch_unlikely(&remote_pcpu_cache_access)) {
> +		/* Avoid migration between this_cpu_ptr() and spin_lock() */
> +		migrate_disable();
> +		spin_lock(this_cpu_ptr(&locks->spin));
> +	} else {
> +		local_lock(&locks->local);
> +	}
> +}

> +static inline void lru_cache_unlock(struct lru_cache_locks *locks)
> +{
> +	if (static_branch_unlikely(&remote_pcpu_cache_access)) {
> +		spin_unlock(this_cpu_ptr(&locks->spin));
> +		migrate_enable();
> +	} else {
> +		local_unlock(&locks->local);
> +	}
> +}

*why* use migrate_disable(), that's horrible!
Nicolas Saenz Julienne Sept. 22, 2021, 8:47 a.m. UTC | #2
On Wed, 2021-09-22 at 00:03 +0200, Peter Zijlstra wrote:
> On Tue, Sep 21, 2021 at 06:13:20PM +0200, Nicolas Saenz Julienne wrote:
> > +static inline void lru_cache_lock(struct lru_cache_locks *locks)
> > +{
> > +	if (static_branch_unlikely(&remote_pcpu_cache_access)) {
> > +		/* Avoid migration between this_cpu_ptr() and spin_lock() */
> > +		migrate_disable();
> > +		spin_lock(this_cpu_ptr(&locks->spin));
> > +	} else {
> > +		local_lock(&locks->local);
> > +	}
> > +}
> 
> > +static inline void lru_cache_unlock(struct lru_cache_locks *locks)
> > +{
> > +	if (static_branch_unlikely(&remote_pcpu_cache_access)) {
> > +		spin_unlock(this_cpu_ptr(&locks->spin));
> > +		migrate_enable();
> > +	} else {
> > +		local_unlock(&locks->local);
> > +	}
> > +}
> 
> *why* use migrate_disable(), that's horrible!

I was trying to be mindful of RT. They don't appreciate people taking spinlocks
just after having disabled preemption.

I think getting local_lock(&locks->local) is my only option then. But it adds
an extra redundant spinlock in the RT+NOHZ_FULL case.
Sebastian Andrzej Siewior Sept. 22, 2021, 9:20 a.m. UTC | #3
On 2021-09-22 10:47:07 [+0200], nsaenzju@redhat.com wrote:
> > *why* use migrate_disable(), that's horrible!
> 
> I was trying to be mindful of RT. They don't appreciate people taking spinlocks
> just after having disabled preemption.
> 
> I think getting local_lock(&locks->local) is my only option then. But it adds
> an extra redundant spinlock in the RT+NOHZ_FULL case.

spin_lock() does not disable preemption on PREEMPT_RT. You don't
disables preemption on purpose or did I miss that?

Sebastian
Nicolas Saenz Julienne Sept. 22, 2021, 9:50 a.m. UTC | #4
On Wed, 2021-09-22 at 11:20 +0200, Sebastian Andrzej Siewior wrote:
> On 2021-09-22 10:47:07 [+0200], nsaenzju@redhat.com wrote:
> > > *why* use migrate_disable(), that's horrible!
> > 
> > I was trying to be mindful of RT. They don't appreciate people taking spinlocks
> > just after having disabled preemption.
> > 
> > I think getting local_lock(&locks->local) is my only option then. But it adds
> > an extra redundant spinlock in the RT+NOHZ_FULL case.
> 
> spin_lock() does not disable preemption on PREEMPT_RT. You don't
> disables preemption on purpose or did I miss that?

Sorry my message wasn't clear. Adding code for context:

+ static inline void lru_cache_lock(struct lru_cache_locks *locks)
+ {
+ 	if (static_branch_unlikely(&remote_pcpu_cache_access)) {
+ 		/* Avoid migration between this_cpu_ptr() and spin_lock() */
+ 		migrate_disable();

IIUC PeterZ would've preferred that I disable preemption here. And what I meant
to explain is that, given the spinlock below, I choose migrate_disable() over
preempt_disable() to cater for RT.

+ 		spin_lock(this_cpu_ptr(&locks->spin));
+ 	} else {


So, to make both worlds happy, I think the only option left is using the
local_lock (at the expense of extra overhead in the RT+NOHZ_FULL case):

+ 	if (static_branch_unlikely(&remote_pcpu_cache_access)) {
+ 		/* Local lock avoids migration between this_cpu_ptr() and spin_lock() */
+		local_lock(&locks->local);
+ 		spin_lock(this_cpu_ptr(&locks->spin));
+	} else {
Peter Zijlstra Sept. 22, 2021, 11:37 a.m. UTC | #5
On Wed, Sep 22, 2021 at 10:47:07AM +0200, nsaenzju@redhat.com wrote:
> On Wed, 2021-09-22 at 00:03 +0200, Peter Zijlstra wrote:
> > On Tue, Sep 21, 2021 at 06:13:20PM +0200, Nicolas Saenz Julienne wrote:
> > > +static inline void lru_cache_lock(struct lru_cache_locks *locks)
> > > +{
> > > +	if (static_branch_unlikely(&remote_pcpu_cache_access)) {
> > > +		/* Avoid migration between this_cpu_ptr() and spin_lock() */
> > > +		migrate_disable();
> > > +		spin_lock(this_cpu_ptr(&locks->spin));
> > > +	} else {
> > > +		local_lock(&locks->local);
> > > +	}
> > > +}
> > 
> > > +static inline void lru_cache_unlock(struct lru_cache_locks *locks)
> > > +{
> > > +	if (static_branch_unlikely(&remote_pcpu_cache_access)) {
> > > +		spin_unlock(this_cpu_ptr(&locks->spin));
> > > +		migrate_enable();
> > > +	} else {
> > > +		local_unlock(&locks->local);
> > > +	}
> > > +}
> > 
> > *why* use migrate_disable(), that's horrible!
> 
> I was trying to be mindful of RT. They don't appreciate people taking spinlocks
> just after having disabled preemption.
> 
> I think getting local_lock(&locks->local) is my only option then. But it adds
> an extra redundant spinlock in the RT+NOHZ_FULL case.

That doesn't make it less horrible. The fundamental problem you seem to
have is that you have to do the this_cpu thing multiple times.

If instead you make sure to only ever do the per-cpu deref *once* and
then make sure you use the same lru_rotate.pvec as you used
lru_rotate.locks, it all works out much nicer.

Then you can write things like:

	struct lru_rotate *lr = raw_cpu_ptr(&lru_rotate);

	frob_lock(&lr->locks);
	frob_pvec(&lr->pvec);
	frob_unlock(&lr->locks);

and it all no longer matters if you got this or that CPU's instance.

After all, you no longer rely on per-cpu ness for serialization but the
lock.
Nicolas Saenz Julienne Sept. 22, 2021, 11:43 a.m. UTC | #6
On Wed, 2021-09-22 at 13:37 +0200, Peter Zijlstra wrote:
> On Wed, Sep 22, 2021 at 10:47:07AM +0200, nsaenzju@redhat.com wrote:
> > On Wed, 2021-09-22 at 00:03 +0200, Peter Zijlstra wrote:
> > > On Tue, Sep 21, 2021 at 06:13:20PM +0200, Nicolas Saenz Julienne wrote:
> > > > +static inline void lru_cache_lock(struct lru_cache_locks *locks)
> > > > +{
> > > > +	if (static_branch_unlikely(&remote_pcpu_cache_access)) {
> > > > +		/* Avoid migration between this_cpu_ptr() and spin_lock() */
> > > > +		migrate_disable();
> > > > +		spin_lock(this_cpu_ptr(&locks->spin));
> > > > +	} else {
> > > > +		local_lock(&locks->local);
> > > > +	}
> > > > +}
> > > 
> > > > +static inline void lru_cache_unlock(struct lru_cache_locks *locks)
> > > > +{
> > > > +	if (static_branch_unlikely(&remote_pcpu_cache_access)) {
> > > > +		spin_unlock(this_cpu_ptr(&locks->spin));
> > > > +		migrate_enable();
> > > > +	} else {
> > > > +		local_unlock(&locks->local);
> > > > +	}
> > > > +}
> > > 
> > > *why* use migrate_disable(), that's horrible!
> > 
> > I was trying to be mindful of RT. They don't appreciate people taking spinlocks
> > just after having disabled preemption.
> > 
> > I think getting local_lock(&locks->local) is my only option then. But it adds
> > an extra redundant spinlock in the RT+NOHZ_FULL case.
> 
> That doesn't make it less horrible. The fundamental problem you seem to
> have is that you have to do the this_cpu thing multiple times.
> 
> If instead you make sure to only ever do the per-cpu deref *once* and
> then make sure you use the same lru_rotate.pvec as you used
> lru_rotate.locks, it all works out much nicer.
> 
> Then you can write things like:
> 
> 	struct lru_rotate *lr = raw_cpu_ptr(&lru_rotate);
> 
> 	frob_lock(&lr->locks);
> 	frob_pvec(&lr->pvec);
> 	frob_unlock(&lr->locks);
> 
> and it all no longer matters if you got this or that CPU's instance.
> 
> After all, you no longer rely on per-cpu ness for serialization but the
> lock.

Thanks, understood. I'll look into it.
diff mbox series

Patch

diff --git a/mm/internal.h b/mm/internal.h
index 18256e32a14c..5a2cef7cd394 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -32,6 +32,8 @@ 
 /* Do not use these with a slab allocator */
 #define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
 
+extern struct static_key_false remote_pcpu_cache_access;
+
 void page_writeback_init(void);
 
 static inline void *folio_raw_mapping(struct folio *folio)
diff --git a/mm/swap.c b/mm/swap.c
index e7f9e4018ccf..bcf73bd563a6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -46,13 +46,27 @@ 
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
 
+/*
+ * On some setups, like with nohz_full, CPUs might be too busy to handle
+ * per-cpu drain work, leading to unwarranted interruptions and hangs. This
+ * key, when enabled, allows for remote draining of these per-cpu caches/page
+ * lists at the cost of more constraining locking.
+ */
+__ro_after_init DEFINE_STATIC_KEY_FALSE(remote_pcpu_cache_access);
+
+struct lru_cache_locks {
+	local_lock_t local;
+	spinlock_t spin;
+};
+
 /* Protecting only lru_rotate.pvec which requires disabling interrupts */
 struct lru_rotate {
-	local_lock_t lock;
+	struct lru_cache_locks locks;
 	struct pagevec pvec;
 };
 static DEFINE_PER_CPU(struct lru_rotate, lru_rotate) = {
-	.lock = INIT_LOCAL_LOCK(lock),
+	.locks.local = INIT_LOCAL_LOCK(lru_rotate.locks.local),
+	.locks.spin = __SPIN_LOCK_UNLOCKED(lru_rotate.locks.spin),
 };
 
 /*
@@ -60,7 +74,7 @@  static DEFINE_PER_CPU(struct lru_rotate, lru_rotate) = {
  * by disabling preemption (and interrupts remain enabled).
  */
 struct lru_pvecs {
-	local_lock_t lock;
+	struct lru_cache_locks locks;
 	struct pagevec lru_add;
 	struct pagevec lru_deactivate_file;
 	struct pagevec lru_deactivate;
@@ -70,9 +84,94 @@  struct lru_pvecs {
 #endif
 };
 static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
-	.lock = INIT_LOCAL_LOCK(lock),
+	.locks.local = INIT_LOCAL_LOCK(lru_pvecs.locks.local),
+	.locks.spin = __SPIN_LOCK_UNLOCKED(lru_pvecs.locks.spin),
 };
 
+static inline void lru_cache_lock(struct lru_cache_locks *locks)
+{
+	if (static_branch_unlikely(&remote_pcpu_cache_access)) {
+		/* Avoid migration between this_cpu_ptr() and spin_lock() */
+		migrate_disable();
+		spin_lock(this_cpu_ptr(&locks->spin));
+	} else {
+		local_lock(&locks->local);
+	}
+}
+
+static inline void lru_cache_lock_irqsave(struct lru_cache_locks *locks,
+					  unsigned long *flagsp)
+{
+	if (static_branch_unlikely(&remote_pcpu_cache_access)) {
+		/* Avoid migration between this_cpu_ptr() and spin_lock_irqsave() */
+		migrate_disable();
+		spin_lock_irqsave(this_cpu_ptr(&locks->spin), *flagsp);
+	} else {
+		local_lock_irqsave(&locks->local, *flagsp);
+	}
+}
+
+/*
+ * The lru_cache_lock_cpu()/lru_cache_lock_irqsave_cpu() flavor of functions
+ * should only be used from remote CPUs when 'remote_pcpu_cache_access' is
+ * enabled or the target CPU is dead. Otherwise, it can still be called on the
+ * local CPU with migration disabled.
+ */
+static inline void lru_cache_lock_cpu(struct lru_cache_locks *locks, int cpu)
+{
+	if (static_branch_unlikely(&remote_pcpu_cache_access))
+		spin_lock(per_cpu_ptr(&locks->spin, cpu));
+	else
+		local_lock(&locks->local);
+}
+
+static inline void lru_cache_lock_irqsave_cpu(struct lru_cache_locks *locks,
+					      unsigned long *flagsp, int cpu)
+{
+	if (static_branch_unlikely(&remote_pcpu_cache_access))
+		spin_lock_irqsave(per_cpu_ptr(&locks->spin, cpu), *flagsp);
+	else
+		local_lock_irqsave(&locks->local, *flagsp);
+}
+
+static inline void lru_cache_unlock(struct lru_cache_locks *locks)
+{
+	if (static_branch_unlikely(&remote_pcpu_cache_access)) {
+		spin_unlock(this_cpu_ptr(&locks->spin));
+		migrate_enable();
+	} else {
+		local_unlock(&locks->local);
+	}
+}
+
+static inline void lru_cache_unlock_irqrestore(struct lru_cache_locks *locks,
+					       unsigned long flags)
+{
+	if (static_branch_unlikely(&remote_pcpu_cache_access)) {
+		spin_unlock_irqrestore(this_cpu_ptr(&locks->spin), flags);
+		migrate_enable();
+	} else {
+		local_unlock_irqrestore(&locks->local, flags);
+	}
+}
+
+static inline void lru_cache_unlock_cpu(struct lru_cache_locks *locks, int cpu)
+{
+	if (static_branch_unlikely(&remote_pcpu_cache_access))
+		spin_unlock(per_cpu_ptr(&locks->spin, cpu));
+	else
+		local_unlock(&locks->local);
+}
+
+static inline void lru_cache_unlock_irqrestore_cpu(struct lru_cache_locks *locks,
+						   unsigned long flags, int cpu)
+{
+	if (static_branch_unlikely(&remote_pcpu_cache_access))
+		spin_unlock_irqrestore(per_cpu_ptr(&locks->spin, cpu), flags);
+	else
+		local_unlock_irqrestore(&locks->local, flags);
+}
+
 /*
  * This path almost never happens for VM activity - pages are normally
  * freed via pagevecs.  But it gets used by networking.
@@ -245,11 +344,11 @@  void folio_rotate_reclaimable(struct folio *folio)
 		unsigned long flags;
 
 		folio_get(folio);
-		local_lock_irqsave(&lru_rotate.lock, flags);
+		lru_cache_lock_irqsave(&lru_rotate.locks, &flags);
 		pvec = this_cpu_ptr(&lru_rotate.pvec);
 		if (pagevec_add_and_need_flush(pvec, &folio->page))
 			pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
-		local_unlock_irqrestore(&lru_rotate.lock, flags);
+		lru_cache_unlock_irqrestore(&lru_rotate.locks, flags);
 	}
 }
 
@@ -341,11 +440,11 @@  static void folio_activate(struct folio *folio)
 		struct pagevec *pvec;
 
 		folio_get(folio);
-		local_lock(&lru_pvecs.lock);
+		lru_cache_lock(&lru_pvecs.locks);
 		pvec = this_cpu_ptr(&lru_pvecs.activate_page);
 		if (pagevec_add_and_need_flush(pvec, &folio->page))
 			pagevec_lru_move_fn(pvec, __activate_page);
-		local_unlock(&lru_pvecs.lock);
+		lru_cache_unlock(&lru_pvecs.locks);
 	}
 }
 
@@ -372,7 +471,7 @@  static void __lru_cache_activate_folio(struct folio *folio)
 	struct pagevec *pvec;
 	int i;
 
-	local_lock(&lru_pvecs.lock);
+	lru_cache_lock(&lru_pvecs.locks);
 	pvec = this_cpu_ptr(&lru_pvecs.lru_add);
 
 	/*
@@ -394,7 +493,7 @@  static void __lru_cache_activate_folio(struct folio *folio)
 		}
 	}
 
-	local_unlock(&lru_pvecs.lock);
+	lru_cache_unlock(&lru_pvecs.locks);
 }
 
 /*
@@ -453,11 +552,11 @@  void folio_add_lru(struct folio *folio)
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 
 	folio_get(folio);
-	local_lock(&lru_pvecs.lock);
+	lru_cache_lock(&lru_pvecs.locks);
 	pvec = this_cpu_ptr(&lru_pvecs.lru_add);
 	if (pagevec_add_and_need_flush(pvec, &folio->page))
 		__pagevec_lru_add(pvec);
-	local_unlock(&lru_pvecs.lock);
+	lru_cache_unlock(&lru_pvecs.locks);
 }
 EXPORT_SYMBOL(folio_add_lru);
 
@@ -592,8 +691,9 @@  static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
 
 /*
  * Drain pages out of the cpu's pagevecs.
- * Either "cpu" is the current CPU, and preemption has already been
- * disabled; or "cpu" is being hot-unplugged, and is already dead.
+ * Either "cpu" is the current CPU, and preemption has already been disabled,
+ * or we're remotely flushing pvecs with the 'remote_pcpu_cache_access' key
+ * enabled, or "cpu" is being hot-unplugged and is already dead.
  */
 void lru_add_drain_cpu(int cpu)
 {
@@ -608,9 +708,9 @@  void lru_add_drain_cpu(int cpu)
 		unsigned long flags;
 
 		/* No harm done if a racing interrupt already did this */
-		local_lock_irqsave(&lru_rotate.lock, flags);
+		lru_cache_lock_irqsave_cpu(&lru_rotate.locks, &flags, cpu);
 		pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
-		local_unlock_irqrestore(&lru_rotate.lock, flags);
+		lru_cache_unlock_irqrestore_cpu(&lru_rotate.locks, flags, cpu);
 	}
 
 	pvec = &per_cpu(lru_pvecs.lru_deactivate_file, cpu);
@@ -649,12 +749,12 @@  void deactivate_file_page(struct page *page)
 	if (likely(get_page_unless_zero(page))) {
 		struct pagevec *pvec;
 
-		local_lock(&lru_pvecs.lock);
+		lru_cache_lock(&lru_pvecs.locks);
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
 
 		if (pagevec_add_and_need_flush(pvec, page))
 			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn);
-		local_unlock(&lru_pvecs.lock);
+		lru_cache_unlock(&lru_pvecs.locks);
 	}
 }
 
@@ -671,12 +771,12 @@  void deactivate_page(struct page *page)
 	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
 		struct pagevec *pvec;
 
-		local_lock(&lru_pvecs.lock);
+		lru_cache_lock(&lru_pvecs.locks);
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate);
 		get_page(page);
 		if (pagevec_add_and_need_flush(pvec, page))
 			pagevec_lru_move_fn(pvec, lru_deactivate_fn);
-		local_unlock(&lru_pvecs.lock);
+		lru_cache_unlock(&lru_pvecs.locks);
 	}
 }
 
@@ -693,28 +793,28 @@  void mark_page_lazyfree(struct page *page)
 	    !PageSwapCache(page) && !PageUnevictable(page)) {
 		struct pagevec *pvec;
 
-		local_lock(&lru_pvecs.lock);
+		lru_cache_lock(&lru_pvecs.locks);
 		pvec = this_cpu_ptr(&lru_pvecs.lru_lazyfree);
 		get_page(page);
 		if (pagevec_add_and_need_flush(pvec, page))
 			pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
-		local_unlock(&lru_pvecs.lock);
+		lru_cache_unlock(&lru_pvecs.locks);
 	}
 }
 
 void lru_add_drain(void)
 {
-	local_lock(&lru_pvecs.lock);
+	lru_cache_lock(&lru_pvecs.locks);
 	lru_add_drain_cpu(smp_processor_id());
-	local_unlock(&lru_pvecs.lock);
+	lru_cache_unlock(&lru_pvecs.locks);
 }
 
 void lru_add_drain_cpu_zone(struct zone *zone)
 {
-	local_lock(&lru_pvecs.lock);
+	lru_cache_lock(&lru_pvecs.locks);
 	lru_add_drain_cpu(smp_processor_id());
 	drain_local_pages(zone);
-	local_unlock(&lru_pvecs.lock);
+	lru_cache_unlock(&lru_pvecs.locks);
 }
 
 #ifdef CONFIG_SMP