diff mbox series

[1/2] mm/zsmalloc: don't hold locks of all pages when free_zspage()

Message ID 20240226-zsmalloc-zspage-rcu-v1-1-456b0ef1a89d@bytedance.com (mailing list archive)
State New
Headers show
Series mm/zsmalloc: simplify synchronization between zs_page_migrate() and free_zspage() | expand

Commit Message

Chengming Zhou Feb. 27, 2024, 3:02 a.m. UTC
free_zspage() has to hold locks of all pages, since zs_page_migrate()
path rely on this page lock to protect the race between zs_free() and
it, so it can safely get zspage from page->private.

But this way is not good and simple enough:

1. Since zs_free() couldn't be sleepable, it can only trylock pages,
   or has to kick_deferred_free() to defer that to a work.

2. Even in the worker context, async_free_zspage() can't simply
   lock all pages in lock_zspage(), it's still trylock because of
   the race between zs_free() and zs_page_migrate(). Please see
   the commit 2505a981114d ("zsmalloc: fix races between asynchronous
   zspage free and page migration") for details.

Actually, all free_zspage() needs is to get zspage from page safely,
we can use RCU to achieve it easily. Then free_zspage() don't need to
hold locks of all pages, so don't need the deferred free mechanism
at all.

The updated zs_page_migrate() now has two more cases to consider:

1. get_zspage_lockless() return NULL: it means free_zspage() has used
   reset_page() on this page and its reference of page.

2. get_zspage_lockless() return zspage but it's not on pool list:
   it means zspage has been removed from list and in process of free.

I'm not sure what value should be returned in these cases? -EINVAL or
-EAGAIN or other value? If the migration caller can find that page
has no extra referenced and can just free it, I think we should return
-EAGAIN to let the migration caller retry this page later to free it.
Now I choose to use -EINVAL to skip migration of this page, it seems
not a big deal to fail migration of some pages?

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zsmalloc.c | 97 ++++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 56 insertions(+), 41 deletions(-)

Comments

Sergey Senozhatsky Feb. 28, 2024, 4:33 a.m. UTC | #1
On (24/02/27 03:02), Chengming Zhou wrote:
[..]
> @@ -978,10 +974,11 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>  		pages[i] = page;
>  	}
>  
> -	create_page_chain(class, zspage, pages);
>  	init_zspage(class, zspage);
>  	zspage->pool = pool;
>  	zspage->class = class->index;
> +	/* RCU set_zspage() after zspage initialized. */
> +	create_page_chain(class, zspage, pages);

So this hasn't been tested, has it?

init_zspage() does not like to be invoked before create_page_chain(),
because we haven't setup required pointers yet.

So when init_zspage() calls get_first_page() it gets NULL zspage->first_page
which we then use in is_first_page(first_page)->PagePrivate(page). As far as
I can tell.
Chengming Zhou Feb. 28, 2024, 5:14 a.m. UTC | #2
On 2024/2/28 12:33, Sergey Senozhatsky wrote:
> On (24/02/27 03:02), Chengming Zhou wrote:
> [..]
>> @@ -978,10 +974,11 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>>  		pages[i] = page;
>>  	}
>>  
>> -	create_page_chain(class, zspage, pages);
>>  	init_zspage(class, zspage);
>>  	zspage->pool = pool;
>>  	zspage->class = class->index;
>> +	/* RCU set_zspage() after zspage initialized. */
>> +	create_page_chain(class, zspage, pages);
> 
> So this hasn't been tested, has it?
I have tested it in my test vm, but it hasn't KASAN enabled. I tested the
kernel build in tmpfs with zswap enabled using zsmalloc pool, not sure
why the kernel didn't crash then...

> 
> init_zspage() does not like to be invoked before create_page_chain(),
> because we haven't setup required pointers yet.

You're right, I can reproduce the problem with KASAN enabled this time,
create_page_chain() should be put before init_zspage(), which will iterate
over the pages to create free objects list.

> 
> So when init_zspage() calls get_first_page() it gets NULL zspage->first_page
> which we then use in is_first_page(first_page)->PagePrivate(page). As far as
> I can tell.

Thanks! I will fix it and test throughly before send an update.
Sergey Senozhatsky Feb. 28, 2024, 5:29 a.m. UTC | #3
On (24/02/28 13:14), Chengming Zhou wrote:
> On 2024/2/28 12:33, Sergey Senozhatsky wrote:
> > On (24/02/27 03:02), Chengming Zhou wrote:
> > [..]
> >> @@ -978,10 +974,11 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
> >>  		pages[i] = page;
> >>  	}
> >>  
> >> -	create_page_chain(class, zspage, pages);
> >>  	init_zspage(class, zspage);
> >>  	zspage->pool = pool;
> >>  	zspage->class = class->index;
> >> +	/* RCU set_zspage() after zspage initialized. */
> >> +	create_page_chain(class, zspage, pages);
> > 
> > So this hasn't been tested, has it?
> I have tested it in my test vm, but it hasn't KASAN enabled. I tested the
> kernel build in tmpfs with zswap enabled using zsmalloc pool, not sure
> why the kernel didn't crash then...

I hit the problem on non-kasan-enabled kernel.  KASAN was enabled
later on.

[..]

> > So when init_zspage() calls get_first_page() it gets NULL zspage->first_page
> > which we then use in is_first_page(first_page)->PagePrivate(page). As far as
> > I can tell.
>
> Thanks! I will fix it and test throughly before send an update.

I'm curious if we want to add RCU to the picture, given that zsmalloc
is quite often run under memory pressure.
Chengming Zhou Feb. 28, 2024, 5:42 a.m. UTC | #4
On 2024/2/28 13:29, Sergey Senozhatsky wrote:
> On (24/02/28 13:14), Chengming Zhou wrote:
>> On 2024/2/28 12:33, Sergey Senozhatsky wrote:
>>> On (24/02/27 03:02), Chengming Zhou wrote:
>>> [..]
>>>> @@ -978,10 +974,11 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>>>>  		pages[i] = page;
>>>>  	}
>>>>  
>>>> -	create_page_chain(class, zspage, pages);
>>>>  	init_zspage(class, zspage);
>>>>  	zspage->pool = pool;
>>>>  	zspage->class = class->index;
>>>> +	/* RCU set_zspage() after zspage initialized. */
>>>> +	create_page_chain(class, zspage, pages);
>>>
>>> So this hasn't been tested, has it?
>> I have tested it in my test vm, but it hasn't KASAN enabled. I tested the
>> kernel build in tmpfs with zswap enabled using zsmalloc pool, not sure
>> why the kernel didn't crash then...
> 
> I hit the problem on non-kasan-enabled kernel.  KASAN was enabled
> later on.

Ok, It should be a problem with my process, sorry.

> 
> [..]
> 
>>> So when init_zspage() calls get_first_page() it gets NULL zspage->first_page
>>> which we then use in is_first_page(first_page)->PagePrivate(page). As far as
>>> I can tell.
>>
>> Thanks! I will fix it and test throughly before send an update.
> 
> I'm curious if we want to add RCU to the picture, given that zsmalloc
> is quite often run under memory pressure.

Yes, it's a reasonable point. But given struct zspage size has only 56 bytes,
it maybe not a problem to delay its free to RCU?

Thanks.
Sergey Senozhatsky Feb. 28, 2024, 6:01 a.m. UTC | #5
On (24/02/27 03:02), Chengming Zhou wrote:
>  static void __free_zspage(struct zs_pool *pool, struct size_class *class,
>  				struct zspage *zspage)
>  {
> @@ -834,13 +841,12 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,
>  		VM_BUG_ON_PAGE(!PageLocked(page), page);

Who owns page lock here if free_zspage() doesn't trylock_zspage() no longer?

>  		next = get_next_page(page);
>  		reset_page(page);
> -		unlock_page(page);
>  		dec_zone_page_state(page, NR_ZSPAGES);
>  		put_page(page);
>  		page = next;
>  	} while (page != NULL);
>  
> -	cache_free_zspage(pool, zspage);
> +	call_rcu(&zspage->rcu_head, rcu_free_zspage);
>  
>  	class_stat_dec(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage);
>  	atomic_long_sub(class->pages_per_zspage, &pool->pages_allocated);
> @@ -852,16 +858,6 @@ static void free_zspage(struct zs_pool *pool, struct size_class *class,
>  	VM_BUG_ON(get_zspage_inuse(zspage));
>  	VM_BUG_ON(list_empty(&zspage->list));
>  
> -	/*
> -	 * Since zs_free couldn't be sleepable, this function cannot call
> -	 * lock_page. The page locks trylock_zspage got will be released
> -	 * by __free_zspage.
> -	 */
> -	if (!trylock_zspage(zspage)) {
> -		kick_deferred_free(pool);
> -		return;
> -	}
> -
>  	remove_zspage(class, zspage);
>  	__free_zspage(pool, class, zspage);
>  }
Sergey Senozhatsky Feb. 28, 2024, 6:07 a.m. UTC | #6
On (24/02/28 13:42), Chengming Zhou wrote:
> > I'm curious if we want to add RCU to the picture, given that zsmalloc
> > is quite often run under memory pressure.
> 
> Yes, it's a reasonable point. But given struct zspage size has only 56 bytes,
> it maybe not a problem to delay its free to RCU?

Hmm, yeah, probably.  I think it'll make sense to wait for more
"go for it" from Cc-ed folks before we land this series.
Sergey Senozhatsky Feb. 28, 2024, 6:14 a.m. UTC | #7
On (24/02/27 03:02), Chengming Zhou wrote:
> @@ -834,13 +841,12 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,
>  		VM_BUG_ON_PAGE(!PageLocked(page), page);
>  		next = get_next_page(page);
>  		reset_page(page);

reset_page()->__ClearPageMovable()->PageMovable() expects page to be
locked.

> -		unlock_page(page);
>  		dec_zone_page_state(page, NR_ZSPAGES);
>  		put_page(page);
>  		page = next;
>  	} while (page != NULL);
Chengming Zhou Feb. 28, 2024, 6:44 a.m. UTC | #8
On 2024/2/28 14:01, Sergey Senozhatsky wrote:
> On (24/02/27 03:02), Chengming Zhou wrote:
>>  static void __free_zspage(struct zs_pool *pool, struct size_class *class,
>>  				struct zspage *zspage)
>>  {
>> @@ -834,13 +841,12 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,
>>  		VM_BUG_ON_PAGE(!PageLocked(page), page);
> 
> Who owns page lock here if free_zspage() doesn't trylock_zspage() no longer?

Right, it should be removed.

> 
>>  		next = get_next_page(page);
>>  		reset_page(page);
>> -		unlock_page(page);
>>  		dec_zone_page_state(page, NR_ZSPAGES);
>>  		put_page(page);
>>  		page = next;
>>  	} while (page != NULL);
>>  
>> -	cache_free_zspage(pool, zspage);
>> +	call_rcu(&zspage->rcu_head, rcu_free_zspage);
>>  
>>  	class_stat_dec(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage);
>>  	atomic_long_sub(class->pages_per_zspage, &pool->pages_allocated);
>> @@ -852,16 +858,6 @@ static void free_zspage(struct zs_pool *pool, struct size_class *class,
>>  	VM_BUG_ON(get_zspage_inuse(zspage));
>>  	VM_BUG_ON(list_empty(&zspage->list));
>>  
>> -	/*
>> -	 * Since zs_free couldn't be sleepable, this function cannot call
>> -	 * lock_page. The page locks trylock_zspage got will be released
>> -	 * by __free_zspage.
>> -	 */
>> -	if (!trylock_zspage(zspage)) {
>> -		kick_deferred_free(pool);
>> -		return;
>> -	}
>> -
>>  	remove_zspage(class, zspage);
>>  	__free_zspage(pool, class, zspage);
>>  }
Chengming Zhou Feb. 28, 2024, 6:49 a.m. UTC | #9
On 2024/2/28 14:14, Sergey Senozhatsky wrote:
> On (24/02/27 03:02), Chengming Zhou wrote:
>> @@ -834,13 +841,12 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,
>>  		VM_BUG_ON_PAGE(!PageLocked(page), page);
>>  		next = get_next_page(page);
>>  		reset_page(page);
> 
> reset_page()->__ClearPageMovable()->PageMovable() expects page to be
> locked.

This seems to make the patch doesn't work anymore... will think about it.

Thanks!

> 
>> -		unlock_page(page);
>>  		dec_zone_page_state(page, NR_ZSPAGES);
>>  		put_page(page);
>>  		page = next;
>>  	} while (page != NULL);
diff mbox series

Patch

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 63ec385cd670..b153f2e5fc0f 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -253,6 +253,7 @@  struct zspage {
 	struct list_head list; /* fullness list */
 	struct zs_pool *pool;
 	rwlock_t lock;
+	struct rcu_head rcu_head;
 };
 
 struct mapping_area {
@@ -310,6 +311,8 @@  static int create_cache(struct zs_pool *pool)
 static void destroy_cache(struct zs_pool *pool)
 {
 	kmem_cache_destroy(pool->handle_cachep);
+	/* Synchronize RCU to free zspage. */
+	synchronize_rcu();
 	kmem_cache_destroy(pool->zspage_cachep);
 }
 
@@ -335,6 +338,14 @@  static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage)
 	kmem_cache_free(pool->zspage_cachep, zspage);
 }
 
+static void rcu_free_zspage(struct rcu_head *h)
+{
+	struct zspage *zspage = container_of(h, struct zspage, rcu_head);
+	struct zs_pool *pool = zspage->pool;
+
+	kmem_cache_free(pool->zspage_cachep, zspage);
+}
+
 /* pool->lock(which owns the handle) synchronizes races */
 static void record_obj(unsigned long handle, unsigned long obj)
 {
@@ -710,14 +721,31 @@  static int fix_fullness_group(struct size_class *class, struct zspage *zspage)
 	return newfg;
 }
 
+static void set_zspage(struct page *page, struct zspage *zspage)
+{
+	struct zspage __rcu **private = (struct zspage __rcu **)&page->private;
+
+	rcu_assign_pointer(*private, zspage);
+}
+
 static struct zspage *get_zspage(struct page *page)
 {
-	struct zspage *zspage = (struct zspage *)page_private(page);
+	struct zspage __rcu **private = (struct zspage __rcu **)&page->private;
+	struct zspage *zspage;
 
+	zspage = rcu_dereference_protected(*private, true);
 	BUG_ON(zspage->magic != ZSPAGE_MAGIC);
 	return zspage;
 }
 
+/* Only used in zs_page_migrate() to get zspage locklessly. */
+static struct zspage *get_zspage_lockless(struct page *page)
+{
+	struct zspage __rcu **private = (struct zspage __rcu **)&page->private;
+
+	return rcu_dereference(*private);
+}
+
 static struct page *get_next_page(struct page *page)
 {
 	struct zspage *zspage = get_zspage(page);
@@ -793,32 +821,11 @@  static void reset_page(struct page *page)
 {
 	__ClearPageMovable(page);
 	ClearPagePrivate(page);
-	set_page_private(page, 0);
+	set_zspage(page, NULL);
 	page_mapcount_reset(page);
 	page->index = 0;
 }
 
-static int trylock_zspage(struct zspage *zspage)
-{
-	struct page *cursor, *fail;
-
-	for (cursor = get_first_page(zspage); cursor != NULL; cursor =
-					get_next_page(cursor)) {
-		if (!trylock_page(cursor)) {
-			fail = cursor;
-			goto unlock;
-		}
-	}
-
-	return 1;
-unlock:
-	for (cursor = get_first_page(zspage); cursor != fail; cursor =
-					get_next_page(cursor))
-		unlock_page(cursor);
-
-	return 0;
-}
-
 static void __free_zspage(struct zs_pool *pool, struct size_class *class,
 				struct zspage *zspage)
 {
@@ -834,13 +841,12 @@  static void __free_zspage(struct zs_pool *pool, struct size_class *class,
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		next = get_next_page(page);
 		reset_page(page);
-		unlock_page(page);
 		dec_zone_page_state(page, NR_ZSPAGES);
 		put_page(page);
 		page = next;
 	} while (page != NULL);
 
-	cache_free_zspage(pool, zspage);
+	call_rcu(&zspage->rcu_head, rcu_free_zspage);
 
 	class_stat_dec(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage);
 	atomic_long_sub(class->pages_per_zspage, &pool->pages_allocated);
@@ -852,16 +858,6 @@  static void free_zspage(struct zs_pool *pool, struct size_class *class,
 	VM_BUG_ON(get_zspage_inuse(zspage));
 	VM_BUG_ON(list_empty(&zspage->list));
 
-	/*
-	 * Since zs_free couldn't be sleepable, this function cannot call
-	 * lock_page. The page locks trylock_zspage got will be released
-	 * by __free_zspage.
-	 */
-	if (!trylock_zspage(zspage)) {
-		kick_deferred_free(pool);
-		return;
-	}
-
 	remove_zspage(class, zspage);
 	__free_zspage(pool, class, zspage);
 }
@@ -929,7 +925,7 @@  static void create_page_chain(struct size_class *class, struct zspage *zspage,
 	 */
 	for (i = 0; i < nr_pages; i++) {
 		page = pages[i];
-		set_page_private(page, (unsigned long)zspage);
+		set_zspage(page, zspage);
 		page->index = 0;
 		if (i == 0) {
 			zspage->first_page = page;
@@ -978,10 +974,11 @@  static struct zspage *alloc_zspage(struct zs_pool *pool,
 		pages[i] = page;
 	}
 
-	create_page_chain(class, zspage, pages);
 	init_zspage(class, zspage);
 	zspage->pool = pool;
 	zspage->class = class->index;
+	/* RCU set_zspage() after zspage initialized. */
+	create_page_chain(class, zspage, pages);
 
 	return zspage;
 }
@@ -1765,17 +1762,35 @@  static int zs_page_migrate(struct page *newpage, struct page *page,
 
 	VM_BUG_ON_PAGE(!PageIsolated(page), page);
 
-	/* The page is locked, so this pointer must remain valid */
-	zspage = get_zspage(page);
-	pool = zspage->pool;
+	rcu_read_lock();
+	zspage = get_zspage_lockless(page);
+	if (!zspage) {
+		rcu_read_unlock();
+		return -EINVAL;
+	}
 
 	/*
 	 * The pool's lock protects the race between zpage migration
-	 * and zs_free.
+	 * and zs_free. We check if the zspage is still in pool with
+	 * pool->lock protection. If the zspage isn't in pool anymore,
+	 * it should be freed by RCU soon.
 	 */
+	pool = zspage->pool;
 	spin_lock(&pool->lock);
 	class = zspage_class(pool, zspage);
 
+	if (list_empty(&zspage->list)) {
+		spin_unlock(&pool->lock);
+		rcu_read_unlock();
+		return -EINVAL;
+	}
+
+	/*
+	 * Now the zspage is still on pool, and we held pool->lock,
+	 * it can't be freed in the meantime.
+	 */
+	rcu_read_unlock();
+
 	/* the migrate_write_lock protects zpage access via zs_map_object */
 	migrate_write_lock(zspage);