diff mbox series

mm/zsmalloc: Replace bit spinlock and get_cpu_var() usage.

Message ID 20210928084419.mkfu62barwrsvflq@linutronix.de (mailing list archive)
State New
Headers show
Series mm/zsmalloc: Replace bit spinlock and get_cpu_var() usage. | expand

Commit Message

Sebastian Andrzej Siewior Sept. 28, 2021, 8:44 a.m. UTC
From: Mike Galbraith <umgwanakikbuti@gmail.com>

For efficiency reasons, zsmalloc is using a slim `handle'. The value is
the address of a memory allocation of 4 or 8 bytes depending on the size
of the long data type. The lowest bit in that allocated memory is used
as a bit spin lock.
The usage of the bit spin lock is problematic because with the bit spin
lock held zsmalloc acquires a rwlock_t and spinlock_t which are both
sleeping locks on PREEMPT_RT and therefore must not be acquired with
disabled preemption.

Extend the handle to struct zsmalloc_handle which holds the old handle as
addr and a spinlock_t which replaces the bit spinlock. Replace all the
wrapper functions accordingly.

The usage of get_cpu_var() in zs_map_object() is problematic because
it disables preemption and makes it impossible to acquire any sleeping
lock on PREEMPT_RT such as a spinlock_t.
Replace the get_cpu_var() usage with a local_lock_t which is embedded
struct mapping_area. It ensures that the access the struct is
synchronized against all users on the same CPU.

This survived LTP testing.

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[bigeasy: replace the bitspin_lock() with a mutex, get_locked_var() and
 patch description. Mike then fixed the size magic and made handle lock
 spinlock_t.]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---

Minchan, this is the replacement we have so far. There is something
similar for block/zram. By disabling zsmalloc we also have zram disabled
so no fallout.
To my best knowledge, the only usage/testing is LTP based.

 mm/zsmalloc.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 78 insertions(+), 6 deletions(-)

Comments

Andrew Morton Sept. 28, 2021, 10:47 p.m. UTC | #1
On Tue, 28 Sep 2021 10:44:19 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> From: Mike Galbraith <umgwanakikbuti@gmail.com>
> 
> For efficiency reasons, zsmalloc is using a slim `handle'. The value is
> the address of a memory allocation of 4 or 8 bytes depending on the size
> of the long data type. The lowest bit in that allocated memory is used
> as a bit spin lock.
> The usage of the bit spin lock is problematic because with the bit spin
> lock held zsmalloc acquires a rwlock_t and spinlock_t which are both
> sleeping locks on PREEMPT_RT and therefore must not be acquired with
> disabled preemption.
> 
> Extend the handle to struct zsmalloc_handle which holds the old handle as
> addr and a spinlock_t which replaces the bit spinlock. Replace all the
> wrapper functions accordingly.
> 
> The usage of get_cpu_var() in zs_map_object() is problematic because
> it disables preemption and makes it impossible to acquire any sleeping
> lock on PREEMPT_RT such as a spinlock_t.
> Replace the get_cpu_var() usage with a local_lock_t which is embedded
> struct mapping_area. It ensures that the access the struct is
> synchronized against all users on the same CPU.
> 
> This survived LTP testing.

Rather nasty with all the ifdefs and two different locking approaches
to be tested.  What would be the impact of simply switching to the new
scheme for all configs?

Which is identical to asking "what is the impact of switching to the new
scheme for PREEMPT_RT"!  Which is I think an important thing for the
changelog to address?
Mike Galbraith Sept. 29, 2021, 2:11 a.m. UTC | #2
On Tue, 2021-09-28 at 15:47 -0700, Andrew Morton wrote:
> On Tue, 28 Sep 2021 10:44:19 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
>
> > From: Mike Galbraith <umgwanakikbuti@gmail.com>
> >
> > For efficiency reasons, zsmalloc is using a slim `handle'. The value is
> > the address of a memory allocation of 4 or 8 bytes depending on the size
> > of the long data type. The lowest bit in that allocated memory is used
> > as a bit spin lock.
> > The usage of the bit spin lock is problematic because with the bit spin
> > lock held zsmalloc acquires a rwlock_t and spinlock_t which are both
> > sleeping locks on PREEMPT_RT and therefore must not be acquired with
> > disabled preemption.
> >
> > Extend the handle to struct zsmalloc_handle which holds the old handle as
> > addr and a spinlock_t which replaces the bit spinlock. Replace all the
> > wrapper functions accordingly.
> >
> > The usage of get_cpu_var() in zs_map_object() is problematic because
> > it disables preemption and makes it impossible to acquire any sleeping
> > lock on PREEMPT_RT such as a spinlock_t.
> > Replace the get_cpu_var() usage with a local_lock_t which is embedded
> > struct mapping_area. It ensures that the access the struct is
> > synchronized against all users on the same CPU.
> >
> > This survived LTP testing.
>
> Rather nasty with all the ifdefs and two different locking approaches
> to be tested.  What would be the impact of simply switching to the new
> scheme for all configs?
>
> Which is identical to asking "what is the impact of switching to the new
> scheme for PREEMPT_RT"!  Which is I think an important thing for the
> changelog to address?

Good questions both, for which I have no answers.  The problematic bit
spinlock going away would certainly be preferred if deemed acceptable.
I frankly doubt either it or zram would be missed were they to instead
be disabled for RT configs.  I certainly have no use for it, making it
functional was a simple case of boy meets bug, and annoys it to death.

	-Mike
Sebastian Andrzej Siewior Sept. 29, 2021, 7:23 a.m. UTC | #3
On 2021-09-28 15:47:23 [-0700], Andrew Morton wrote:
> Rather nasty with all the ifdefs and two different locking approaches
> to be tested.  What would be the impact of simply switching to the new
> scheme for all configs?

The current scheme uses the lower bit (OBJ_ALLOCATED_TAG) as something
special which is guaranteed to be zero due to memory alignment
requirements. The content of the memory, that long, is then used a bit
spinlock.

Moving it to spinlock_t would consume only 4 bytes of memory assuming
lockdep is off. It is then 4 bytes less than a long on 64 bits archs.
So we could do this if nobody disagrees. The spinlock_t has clearly
advantages over a bit spinlock like the "order" from the qspinlock
implementation. But then I have no idea what the contention here is.
With lockdep enabled the struct gets a little bigger which I assume was to
avoid. But then only debug builds are affected so…
 
> Which is identical to asking "what is the impact of switching to the new
> scheme for PREEMPT_RT"!  Which is I think an important thing for the
> changelog to address?

Well, PREEMPT_RT can't work with the bit spinlock in it. That is that
part from the changelog:
| The usage of the bit spin lock is problematic because with the bit spin
| lock held zsmalloc acquires a rwlock_t and spinlock_t which are both
| sleeping locks on PREEMPT_RT and therefore must not be acquired with
| disabled preemption.


Sebastian
Minchan Kim Sept. 29, 2021, 7:09 p.m. UTC | #4
On Wed, Sep 29, 2021 at 09:23:59AM +0200, Sebastian Andrzej Siewior wrote:
> On 2021-09-28 15:47:23 [-0700], Andrew Morton wrote:
> > Rather nasty with all the ifdefs and two different locking approaches
> > to be tested.  What would be the impact of simply switching to the new
> > scheme for all configs?
> 
> The current scheme uses the lower bit (OBJ_ALLOCATED_TAG) as something
> special which is guaranteed to be zero due to memory alignment
> requirements. The content of the memory, that long, is then used a bit
> spinlock.
> 
> Moving it to spinlock_t would consume only 4 bytes of memory assuming
> lockdep is off. It is then 4 bytes less than a long on 64 bits archs.
> So we could do this if nobody disagrees. The spinlock_t has clearly
> advantages over a bit spinlock like the "order" from the qspinlock
> implementation. But then I have no idea what the contention here is.
> With lockdep enabled the struct gets a little bigger which I assume was to
> avoid. But then only debug builds are affected so…

First of all, thanks for the patch, Sebastian.

The zsmalloc is usually used with swap and swap size is normally several
GB above. Thus, adding per-page spinlock is rather expensive so I'd like to
consider the approach as last resort. About the lock contention, it's rare
so spinlock wouldn't help it much.

Let me try changing the bit lock into sleepable lock in PREEMPT_RT with 
bigger granuarity.
Sebastian Andrzej Siewior Sept. 30, 2021, 6:42 a.m. UTC | #5
On 2021-09-29 12:09:05 [-0700], Minchan Kim wrote:
> First of all, thanks for the patch, Sebastian.
> 
> The zsmalloc is usually used with swap and swap size is normally several
> GB above. Thus, adding per-page spinlock is rather expensive so I'd like to
> consider the approach as last resort. About the lock contention, it's rare
> so spinlock wouldn't help it much.
> 
> Let me try changing the bit lock into sleepable lock in PREEMPT_RT with 
> bigger granuarity.

Okay, thank you. spinlock_t has always four bytes without lockdep so you
could fit twice as many locks compared to a long on 64bit ;)

Sebastian
diff mbox series

Patch

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 68e8831068f4b..22c18ac605b5f 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -57,6 +57,7 @@ 
 #include <linux/wait.h>
 #include <linux/pagemap.h>
 #include <linux/fs.h>
+#include <linux/local_lock.h>
 
 #define ZSPAGE_MAGIC	0x58
 
@@ -77,6 +78,20 @@ 
 
 #define ZS_HANDLE_SIZE (sizeof(unsigned long))
 
+#ifdef CONFIG_PREEMPT_RT
+
+struct zsmalloc_handle {
+	unsigned long addr;
+	spinlock_t lock;
+};
+
+#define ZS_HANDLE_ALLOC_SIZE (sizeof(struct zsmalloc_handle))
+
+#else
+
+#define ZS_HANDLE_ALLOC_SIZE (sizeof(unsigned long))
+#endif
+
 /*
  * Object location (<PFN>, <obj_idx>) is encoded as
  * a single (unsigned long) handle value.
@@ -293,6 +308,7 @@  struct zspage {
 };
 
 struct mapping_area {
+	local_lock_t lock;
 	char *vm_buf; /* copy buffer for objects that span pages */
 	char *vm_addr; /* address of kmap_atomic()'ed pages */
 	enum zs_mapmode vm_mm; /* mapping mode */
@@ -322,7 +338,7 @@  static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage) {}
 
 static int create_cache(struct zs_pool *pool)
 {
-	pool->handle_cachep = kmem_cache_create("zs_handle", ZS_HANDLE_SIZE,
+	pool->handle_cachep = kmem_cache_create("zs_handle", ZS_HANDLE_ALLOC_SIZE,
 					0, 0, NULL);
 	if (!pool->handle_cachep)
 		return 1;
@@ -346,10 +362,27 @@  static void destroy_cache(struct zs_pool *pool)
 
 static unsigned long cache_alloc_handle(struct zs_pool *pool, gfp_t gfp)
 {
-	return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
-			gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
+	void *p;
+
+	p = kmem_cache_alloc(pool->handle_cachep,
+			     gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
+#ifdef CONFIG_PREEMPT_RT
+	if (p) {
+		struct zsmalloc_handle *zh = p;
+
+		spin_lock_init(&zh->lock);
+	}
+#endif
+	return (unsigned long)p;
 }
 
+#ifdef CONFIG_PREEMPT_RT
+static struct zsmalloc_handle *zs_get_pure_handle(unsigned long handle)
+{
+	return (void *)(handle & ~((1 << OBJ_TAG_BITS) - 1));
+}
+#endif
+
 static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
 {
 	kmem_cache_free(pool->handle_cachep, (void *)handle);
@@ -368,12 +401,18 @@  static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage)
 
 static void record_obj(unsigned long handle, unsigned long obj)
 {
+#ifdef CONFIG_PREEMPT_RT
+	struct zsmalloc_handle *zh = zs_get_pure_handle(handle);
+
+	WRITE_ONCE(zh->addr, obj);
+#else
 	/*
 	 * lsb of @obj represents handle lock while other bits
 	 * represent object value the handle is pointing so
 	 * updating shouldn't do store tearing.
 	 */
 	WRITE_ONCE(*(unsigned long *)handle, obj);
+#endif
 }
 
 /* zpool driver */
@@ -455,7 +494,9 @@  MODULE_ALIAS("zpool-zsmalloc");
 #endif /* CONFIG_ZPOOL */
 
 /* per-cpu VM mapping areas for zspage accesses that cross page boundaries */
-static DEFINE_PER_CPU(struct mapping_area, zs_map_area);
+static DEFINE_PER_CPU(struct mapping_area, zs_map_area) = {
+	.lock	= INIT_LOCAL_LOCK(lock),
+};
 
 static bool is_zspage_isolated(struct zspage *zspage)
 {
@@ -862,7 +903,13 @@  static unsigned long location_to_obj(struct page *page, unsigned int obj_idx)
 
 static unsigned long handle_to_obj(unsigned long handle)
 {
+#ifdef CONFIG_PREEMPT_RT
+	struct zsmalloc_handle *zh = zs_get_pure_handle(handle);
+
+	return zh->addr;
+#else
 	return *(unsigned long *)handle;
+#endif
 }
 
 static unsigned long obj_to_head(struct page *page, void *obj)
@@ -876,22 +923,46 @@  static unsigned long obj_to_head(struct page *page, void *obj)
 
 static inline int testpin_tag(unsigned long handle)
 {
+#ifdef CONFIG_PREEMPT_RT
+	struct zsmalloc_handle *zh = zs_get_pure_handle(handle);
+
+	return spin_is_locked(&zh->lock);
+#else
 	return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle);
+#endif
 }
 
 static inline int trypin_tag(unsigned long handle)
 {
+#ifdef CONFIG_PREEMPT_RT
+	struct zsmalloc_handle *zh = zs_get_pure_handle(handle);
+
+	return spin_trylock(&zh->lock);
+#else
 	return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
+#endif
 }
 
 static void pin_tag(unsigned long handle) __acquires(bitlock)
 {
+#ifdef CONFIG_PREEMPT_RT
+	struct zsmalloc_handle *zh = zs_get_pure_handle(handle);
+
+	return spin_lock(&zh->lock);
+#else
 	bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle);
+#endif
 }
 
 static void unpin_tag(unsigned long handle) __releases(bitlock)
 {
+#ifdef CONFIG_PREEMPT_RT
+	struct zsmalloc_handle *zh = zs_get_pure_handle(handle);
+
+	return spin_unlock(&zh->lock);
+#else
 	bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle);
+#endif
 }
 
 static void reset_page(struct page *page)
@@ -1274,7 +1345,8 @@  void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	class = pool->size_class[class_idx];
 	off = (class->size * obj_idx) & ~PAGE_MASK;
 
-	area = &get_cpu_var(zs_map_area);
+	local_lock(&zs_map_area.lock);
+	area = this_cpu_ptr(&zs_map_area);
 	area->vm_mm = mm;
 	if (off + class->size <= PAGE_SIZE) {
 		/* this object is contained entirely within a page */
@@ -1328,7 +1400,7 @@  void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 
 		__zs_unmap_object(area, pages, off, class->size);
 	}
-	put_cpu_var(zs_map_area);
+	local_unlock(&zs_map_area.lock);
 
 	migrate_read_unlock(zspage);
 	unpin_tag(handle);