[v4] mm: add zblock allocator

Message ID	20250412154207.2152667-1-vitaly.wool@konsulko.se (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Vitaly Wool <vitaly.wool@konsulko.se> To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, Nhat Pham <nphamcs@gmail.com>, Shakeel Butt <shakeel.butt@linux.dev>, Johannes Weiner <hannes@cmpxchg.org>, Vitaly Wool <vitaly.wool@konsulko.se>, Igor Belousov <igor.b@beldev.am> Subject: [PATCH v4] mm: add zblock allocator Date: Sat, 12 Apr 2025 17:42:07 +0200 Message-Id: <20250412154207.2152667-1-vitaly.wool@konsulko.se> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v4] mm: add zblock allocator \| expand [v4] mm: add zblock allocator

Message ID

20250412154207.2152667-1-vitaly.wool@konsulko.se (mailing list archive)

State

New

Headers

From: Vitaly Wool <vitaly.wool@konsulko.se>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org,
	Nhat Pham <nphamcs@gmail.com>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Vitaly Wool <vitaly.wool@konsulko.se>,
	Igor Belousov <igor.b@beldev.am>
Subject: [PATCH v4] mm: add zblock allocator
Date: Sat, 12 Apr 2025 17:42:07 +0200
Message-Id: <20250412154207.2152667-1-vitaly.wool@konsulko.se>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[v4] mm: add zblock allocator | expand

Commit Message

Vitaly Wool April 12, 2025, 3:42 p.m. UTC

zblock is a special purpose allocator for storing compressed pages.
It stores integer number of same size objects per its block. These
blocks consist of several physical pages (2**n, i. e. 1/2/4/8).

With zblock, it is possible to densely arrange objects of various sizes
resulting in low internal fragmentation. Also this allocator tries to
fill incomplete blocks instead of adding new ones, in many cases
providing a compression ratio comparable to zmalloc's.

zblock is also in most cases superior to zsmalloc with regard to
average performance and worst execution times, thus allowing for better
response time and real-time characteristics of the whole system.

High memory and page migration are currently not supported by zblock.

Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.se>
Signed-off-by: Igor Belousov <igor.b@beldev.am>
---
v3: https://patchwork.kernel.org/project/linux-mm/patch/20250408125211.1611879-1-vitaly.wool@konsulko.se/
Changes since v3:
- rebased and tested against the latest -mm tree
- the block descriptors table was updated for better compression ratio
- fixed the bug with wrong SLOT_BITS value
- slot search moved to find_and_claim_block()

Test results (zstd compressor, 8 core Ryzen 9 VM, make bzImage):
- zblock:
    real	6m52.621s
    user	33m41.771s
    sys 	6m28.825s
    Zswap:            162328 kB
    Zswapped:         754468 kB
    zswpin 93851
    zswpout 542481
    zswpwb 935
- zsmalloc:
    real	7m4.355s
    user	34m37.538s
    sys 	6m22.086s
    zswpin 101243
    zswpout 448217
    zswpwb 640
    Zswap:            175704 kB
    Zswapped:         778692 kB

 Documentation/mm/zblock.rst |  24 ++
 MAINTAINERS                 |   7 +
 mm/Kconfig                  |  12 +
 mm/Makefile                 |   1 +
 mm/zblock.c                 | 443 ++++++++++++++++++++++++++++++++++++
 mm/zblock.h                 | 176 ++++++++++++++
 6 files changed, 663 insertions(+)
 create mode 100644 Documentation/mm/zblock.rst
 create mode 100644 mm/zblock.c
 create mode 100644 mm/zblock.h

Comments

Igor Belousov April 12, 2025, 7:25 p.m. UTC | #1

2025-04-12 19:42 skrev Vitaly Wool:
> zblock is a special purpose allocator for storing compressed pages.
> It stores integer number of same size objects per its block. These
> blocks consist of several physical pages (2**n, i. e. 1/2/4/8).
> 
> With zblock, it is possible to densely arrange objects of various sizes
> resulting in low internal fragmentation. Also this allocator tries to
> fill incomplete blocks instead of adding new ones, in many cases
> providing a compression ratio comparable to zmalloc's.
> 
> zblock is also in most cases superior to zsmalloc with regard to
> average performance and worst execution times, thus allowing for better
> response time and real-time characteristics of the whole system.
> 
> High memory and page migration are currently not supported by zblock.
> 
> Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.se>
> Signed-off-by: Igor Belousov <igor.b@beldev.am>

Tested-by: Igor Belousov <igor.b@beldev.am>

> ---
> v3: 
> https://patchwork.kernel.org/project/linux-mm/patch/20250408125211.1611879-1-vitaly.wool@konsulko.se/
> Changes since v3:
> - rebased and tested against the latest -mm tree
> - the block descriptors table was updated for better compression ratio
> - fixed the bug with wrong SLOT_BITS value
> - slot search moved to find_and_claim_block()
> 
> Test results (zstd compressor, 8 core Ryzen 9 VM, make bzImage):
> - zblock:
>     real	6m52.621s
>     user	33m41.771s
>     sys 	6m28.825s
>     Zswap:            162328 kB
>     Zswapped:         754468 kB
>     zswpin 93851
>     zswpout 542481
>     zswpwb 935
> - zsmalloc:
>     real	7m4.355s
>     user	34m37.538s
>     sys 	6m22.086s
>     zswpin 101243
>     zswpout 448217
>     zswpwb 640
>     Zswap:            175704 kB
>     Zswapped:         778692 kB
> 
>  Documentation/mm/zblock.rst |  24 ++
>  MAINTAINERS                 |   7 +
>  mm/Kconfig                  |  12 +
>  mm/Makefile                 |   1 +
>  mm/zblock.c                 | 443 ++++++++++++++++++++++++++++++++++++
>  mm/zblock.h                 | 176 ++++++++++++++
>  6 files changed, 663 insertions(+)
>  create mode 100644 Documentation/mm/zblock.rst
>  create mode 100644 mm/zblock.c
>  create mode 100644 mm/zblock.h
> 
> diff --git a/Documentation/mm/zblock.rst b/Documentation/mm/zblock.rst
> new file mode 100644
> index 000000000000..9751434d0b76
> --- /dev/null
> +++ b/Documentation/mm/zblock.rst
> @@ -0,0 +1,24 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +======
> +zblock
> +======
> +
> +zblock is a special purpose allocator for storing compressed pages.
> +It stores integer number of compressed objects per its block. These
> +blocks consist of several physical pages (2**n, i. e. 1/2/4/8).
> +
> +With zblock, it is possible to densely arrange objects of various 
> sizes
> +resulting in low internal fragmentation. Also this allocator tries to
> +fill incomplete blocks instead of adding new ones,  in many cases
> +providing a compression ratio substantially higher than z3fold and 
> zbud
> +(though lower than zmalloc's).
> +
> +zblock does not require MMU to operate and also is superior to 
> zsmalloc
> +with regard to average performance and worst execution times, thus
> +allowing for better response time and real-time characteristics of the
> +whole system.
> +
> +E. g. on a series of stress-ng tests run on a Raspberry Pi 5, we get
> +5-10% higher value for bogo ops/s in zblock/zsmalloc comparison.
> +
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 96b827049501..46465c986005 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -26640,6 +26640,13 @@ 
> F:	Documentation/networking/device_drivers/hamradio/z8530drv.rst
>  F:	drivers/net/hamradio/*scc.c
>  F:	drivers/net/hamradio/z8530.h
> 
> +ZBLOCK COMPRESSED SLAB MEMORY ALLOCATOR
> +M:	Vitaly Wool <vitaly.wool@konsulko.se>
> +L:	linux-mm@kvack.org
> +S:	Maintained
> +F:	Documentation/mm/zblock.rst
> +F:	mm/zblock.[ch]
> +
>  ZD1211RW WIRELESS DRIVER
>  L:	linux-wireless@vger.kernel.org
>  S:	Orphan
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e113f713b493..5aa1479151ec 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -152,6 +152,18 @@ config ZSWAP_ZPOOL_DEFAULT
>         default "zsmalloc" if ZSWAP_ZPOOL_DEFAULT_ZSMALLOC
>         default ""
> 
> +config ZBLOCK
> +	tristate "Fast compression allocator with high density"
> +	depends on ZPOOL
> +	help
> +	  A special purpose allocator for storing compressed pages.
> +	  It stores integer number of same size compressed objects per
> +	  its block. These blocks consist of several physical pages
> +	  (2**n, i. e. 1/2/4/8).
> +
> +	  With zblock, it is possible to densely arrange objects of
> +	  various sizes resulting in low internal fragmentation.
> +
>  config ZSMALLOC
>  	tristate
>  	prompt "N:1 compression allocator (zsmalloc)" if (ZSWAP || ZRAM)
> diff --git a/mm/Makefile b/mm/Makefile
> index e7f6bbf8ae5f..9d7e5b5bb694 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -116,6 +116,7 @@ obj-$(CONFIG_DEBUG_VM_PGTABLE) += 
> debug_vm_pgtable.o
>  obj-$(CONFIG_PAGE_OWNER) += page_owner.o
>  obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
>  obj-$(CONFIG_ZPOOL)	+= zpool.o
> +obj-$(CONFIG_ZBLOCK)	+= zblock.o
>  obj-$(CONFIG_ZSMALLOC)	+= zsmalloc.o
>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>  obj-$(CONFIG_CMA)	+= cma.o
> diff --git a/mm/zblock.c b/mm/zblock.c
> new file mode 100644
> index 000000000000..ecc7aeb611af
> --- /dev/null
> +++ b/mm/zblock.c
> @@ -0,0 +1,443 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * zblock.c
> + *
> + * Author: Vitaly Wool <vitaly.wool@konsulko.se>
> + * Based on the work from Ananda Badmaev <a.badmaev@clicknet.pro>
> + * Copyright (C) 2022-2025, Konsulko AB.
> + *
> + * Zblock is a small object allocator with the intention to serve as a
> + * zpool backend. It operates on page blocks which consist of number
> + * of physical pages being a power of 2 and store integer number of
> + * compressed pages per block which results in determinism and 
> simplicity.
> + *
> + * zblock doesn't export any API and is meant to be used via zpool 
> API.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/atomic.h>
> +#include <linux/list.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/preempt.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/zpool.h>
> +#include "zblock.h"
> +
> +static struct rb_root block_desc_tree = RB_ROOT;
> +
> +/* Encode handle of a particular slot in the pool using metadata */
> +static inline unsigned long metadata_to_handle(struct zblock_block 
> *block,
> +				unsigned int block_type, unsigned int slot)
> +{
> +	return (unsigned long)(block) | (block_type << SLOT_BITS) | slot;
> +}
> +
> +/* Return block, block type and slot in the pool corresponding to 
> handle */
> +static inline struct zblock_block *handle_to_metadata(unsigned long 
> handle,
> +				unsigned int *block_type, unsigned int *slot)
> +{
> +	*block_type = (handle & (PAGE_SIZE - 1)) >> SLOT_BITS;
> +	*slot = handle & SLOT_MASK;
> +	return (struct zblock_block *)(handle & PAGE_MASK);
> +}
> +
> +/*
> + * Find a block with at least one free slot and claim it.
> + * We make sure that the first block, if exists, will always work.
> + */
> +static inline struct zblock_block *find_and_claim_block(struct 
> block_list *b,
> +		int block_type, unsigned long *handle)
> +{
> +	struct list_head *l = &b->active_list;
> +	unsigned int slot;
> +
> +	if (!list_empty(l)) {
> +		struct zblock_block *z = list_first_entry(l, typeof(*z), link);
> +
> +		if (--z->free_slots == 0)
> +			list_move(&z->link, &b->full_list);
> +		/*
> +		 * There is a slot in the block and we just made sure it would
> +		 * remain.
> +		 * Find that slot and set the busy bit.
> +		 */
> +		for (slot = find_first_zero_bit(z->slot_info,
> +					block_desc[block_type].slots_per_block);
> +		     slot < block_desc[block_type].slots_per_block;
> +		     slot = find_next_zero_bit(z->slot_info,
> +					block_desc[block_type].slots_per_block,
> +					slot)) {
> +			if (!test_and_set_bit(slot, z->slot_info))
> +				break;
> +			barrier();
> +		}
> +
> +		WARN_ON(slot >= block_desc[block_type].slots_per_block);
> +		*handle = metadata_to_handle(z, block_type, slot);
> +		return z;
> +	}
> +	return NULL;
> +}
> +
> +/*
> + * allocate new block and add it to corresponding block list
> + */
> +static struct zblock_block *alloc_block(struct zblock_pool *pool,
> +					int block_type, gfp_t gfp,
> +					unsigned long *handle)
> +{
> +	struct zblock_block *block;
> +	struct block_list *block_list;
> +
> +	block = (void *)__get_free_pages(gfp, block_desc[block_type].order);
> +	if (!block)
> +		return NULL;
> +
> +	block_list = &pool->block_lists[block_type];
> +
> +	/* init block data  */
> +	block->free_slots = block_desc[block_type].slots_per_block - 1;
> +	memset(&block->slot_info, 0, sizeof(block->slot_info));
> +	set_bit(0, block->slot_info);
> +	*handle = metadata_to_handle(block, block_type, 0);
> +
> +	spin_lock(&block_list->lock);
> +	list_add(&block->link, &block_list->active_list);
> +	block_list->block_count++;
> +	spin_unlock(&block_list->lock);
> +	return block;
> +}
> +
> +/*****************
> + * API Functions
> + *****************/
> +/**
> + * zblock_create_pool() - create a new zblock pool
> + * @gfp:	gfp flags when allocating the zblock pool structure
> + * @ops:	user-defined operations for the zblock pool
> + *
> + * Return: pointer to the new zblock pool or NULL if the metadata 
> allocation
> + * failed.
> + */
> +static struct zblock_pool *zblock_create_pool(gfp_t gfp)
> +{
> +	struct zblock_pool *pool;
> +	struct block_list *block_list;
> +	int i;
> +
> +	pool = kmalloc(sizeof(struct zblock_pool), gfp);
> +	if (!pool)
> +		return NULL;
> +
> +	/* init each block list */
> +	for (i = 0; i < ARRAY_SIZE(block_desc); i++) {
> +		block_list = &pool->block_lists[i];
> +		spin_lock_init(&block_list->lock);
> +		INIT_LIST_HEAD(&block_list->full_list);
> +		INIT_LIST_HEAD(&block_list->active_list);
> +		block_list->block_count = 0;
> +	}
> +	return pool;
> +}
> +
> +/**
> + * zblock_destroy_pool() - destroys an existing zblock pool
> + * @pool:	the zblock pool to be destroyed
> + *
> + */
> +static void zblock_destroy_pool(struct zblock_pool *pool)
> +{
> +	kfree(pool);
> +}
> +
> +
> +/**
> + * zblock_alloc() - allocates a slot of appropriate size
> + * @pool:	zblock pool from which to allocate
> + * @size:	size in bytes of the desired allocation
> + * @gfp:	gfp flags used if the pool needs to grow
> + * @handle:	handle of the new allocation
> + *
> + * Return: 0 if success and handle is set, otherwise -EINVAL if the 
> size or
> + * gfp arguments are invalid or -ENOMEM if the pool was unable to 
> allocate
> + * a new slot.
> + */
> +static int zblock_alloc(struct zblock_pool *pool, size_t size, gfp_t 
> gfp,
> +			unsigned long *handle)
> +{
> +	int block_type = -1;
> +	struct zblock_block *block;
> +	struct block_list *block_list;
> +
> +	if (!size)
> +		return -EINVAL;
> +
> +	if (size > PAGE_SIZE)
> +		return -ENOSPC;
> +
> +	/* find basic block type with suitable slot size */
> +	if (size < block_desc[0].slot_size)
> +		block_type = 0;
> +	else {
> +		struct block_desc_node *block_node;
> +		struct rb_node *node = block_desc_tree.rb_node;
> +
> +		while (node) {
> +			block_node = container_of(node, typeof(*block_node), node);
> +			if (size < block_node->this_slot_size)
> +				node = node->rb_left;
> +			else if (size >= block_node->next_slot_size)
> +				node = node->rb_right;
> +			else {
> +				block_type = block_node->block_idx + 1;
> +				break;
> +			}
> +		}
> +	}
> +	if (WARN_ON(block_type < 0))
> +		return -EINVAL;
> +	if (block_type >= ARRAY_SIZE(block_desc))
> +		return -ENOSPC;
> +
> +	block_list = &pool->block_lists[block_type];
> +
> +	spin_lock(&block_list->lock);
> +	block = find_and_claim_block(block_list, block_type, handle);
> +	spin_unlock(&block_list->lock);
> +	if (block)
> +		return 0;
> +
> +	/* not found block with free slots try to allocate new empty block */
> +	block = alloc_block(pool, block_type, gfp & ~(__GFP_MOVABLE | 
> __GFP_HIGHMEM), handle);
> +	return block ? 0 : -ENOMEM;
> +}
> +
> +/**
> + * zblock_free() - frees the allocation associated with the given 
> handle
> + * @pool:	pool in which the allocation resided
> + * @handle:	handle associated with the allocation returned by 
> zblock_alloc()
> + *
> + */
> +static void zblock_free(struct zblock_pool *pool, unsigned long 
> handle)
> +{
> +	unsigned int slot, block_type;
> +	struct zblock_block *block;
> +	struct block_list *block_list;
> +
> +	block = handle_to_metadata(handle, &block_type, &slot);
> +	block_list = &pool->block_lists[block_type];
> +
> +	spin_lock(&block_list->lock);
> +	/* if all slots in block are empty delete whole block */
> +	if (++block->free_slots == block_desc[block_type].slots_per_block) {
> +		block_list->block_count--;
> +		list_del(&block->link);
> +		spin_unlock(&block_list->lock);
> +		free_pages((unsigned long)block, block_desc[block_type].order);
> +		return;
> +	} else if (block->free_slots == 1)
> +		list_move_tail(&block->link, &block_list->active_list);
> +	clear_bit(slot, block->slot_info);
> +	spin_unlock(&block_list->lock);
> +}
> +
> +/**
> + * zblock_map() - maps the allocation associated with the given handle
> + * @pool:	pool in which the allocation resides
> + * @handle:	handle associated with the allocation to be mapped
> + *
> + *
> + * Returns: a pointer to the mapped allocation
> + */
> +static void *zblock_map(struct zblock_pool *pool, unsigned long 
> handle)
> +{
> +	unsigned int block_type, slot;
> +	struct zblock_block *block;
> +	unsigned long offs;
> +	void *p;
> +
> +	block = handle_to_metadata(handle, &block_type, &slot);
> +	offs = ZBLOCK_HEADER_SIZE + slot * block_desc[block_type].slot_size;
> +	p = (void *)block + offs;
> +	return p;
> +}
> +
> +/**
> + * zblock_unmap() - unmaps the allocation associated with the given 
> handle
> + * @pool:	pool in which the allocation resides
> + * @handle:	handle associated with the allocation to be unmapped
> + */
> +static void zblock_unmap(struct zblock_pool *pool, unsigned long 
> handle)
> +{
> +}
> +
> +/**
> + * zblock_write() - write to the memory area defined by handle
> + * @pool:	pool in which the allocation resides
> + * @handle:	handle associated with the allocation
> + * @handle_mem: pointer to source memory block
> + * @mem_len:	length of the memory block to write
> + */
> +static void zblock_write(struct zblock_pool *pool, unsigned long 
> handle,
> +			 void *handle_mem, size_t mem_len)
> +{
> +	unsigned int block_type, slot;
> +	struct zblock_block *block;
> +	unsigned long offs;
> +	void *p;
> +
> +	block = handle_to_metadata(handle, &block_type, &slot);
> +	offs = ZBLOCK_HEADER_SIZE + slot * block_desc[block_type].slot_size;
> +	p = (void *)block + offs;
> +	memcpy(p, handle_mem, mem_len);
> +}
> +
> +/**
> + * zblock_get_total_pages() - gets the zblock pool size in pages
> + * @pool:	pool being queried
> + *
> + * Returns: size in bytes of the given pool.
> + */
> +static u64 zblock_get_total_pages(struct zblock_pool *pool)
> +{
> +	u64 total_size;
> +	int i;
> +
> +	total_size = 0;
> +	for (i = 0; i < ARRAY_SIZE(block_desc); i++)
> +		total_size += pool->block_lists[i].block_count << 
> block_desc[i].order;
> +
> +	return total_size;
> +}
> +
> +/*****************
> + * zpool
> + ****************/
> +
> +static void *zblock_zpool_create(const char *name, gfp_t gfp)
> +{
> +	return zblock_create_pool(gfp);
> +}
> +
> +static void zblock_zpool_destroy(void *pool)
> +{
> +	zblock_destroy_pool(pool);
> +}
> +
> +static int zblock_zpool_malloc(void *pool, size_t size, gfp_t gfp,
> +			unsigned long *handle)
> +{
> +	return zblock_alloc(pool, size, gfp, handle);
> +}
> +
> +static void zblock_zpool_free(void *pool, unsigned long handle)
> +{
> +	zblock_free(pool, handle);
> +}
> +
> +static void *zblock_zpool_read_begin(void *pool, unsigned long handle,
> +				void *local_copy)
> +{
> +	return zblock_map(pool, handle);
> +}
> +
> +static void zblock_zpool_obj_write(void *pool, unsigned long handle,
> +				void *handle_mem, size_t mem_len)
> +{
> +	zblock_write(pool, handle, handle_mem, mem_len);
> +}
> +
> +static void zblock_zpool_read_end(void *pool, unsigned long handle,
> +				void *handle_mem)
> +{
> +	zblock_unmap(pool, handle);
> +}
> +
> +static u64 zblock_zpool_total_pages(void *pool)
> +{
> +	return zblock_get_total_pages(pool);
> +}
> +
> +static struct zpool_driver zblock_zpool_driver = {
> +	.type =			"zblock",
> +	.owner =		THIS_MODULE,
> +	.create =		zblock_zpool_create,
> +	.destroy =		zblock_zpool_destroy,
> +	.malloc =		zblock_zpool_malloc,
> +	.free =			zblock_zpool_free,
> +	.obj_read_begin =	zblock_zpool_read_begin,
> +	.obj_read_end =		zblock_zpool_read_end,
> +	.obj_write =		zblock_zpool_obj_write,
> +	.total_pages =		zblock_zpool_total_pages,
> +};
> +
> +MODULE_ALIAS("zpool-zblock");
> +
> +static void delete_rbtree(void)
> +{
> +	while (!RB_EMPTY_ROOT(&block_desc_tree))
> +		rb_erase(block_desc_tree.rb_node, &block_desc_tree);
> +}
> +
> +static int __init create_rbtree(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(block_desc); i++) {
> +		struct block_desc_node *block_node = kmalloc(sizeof(*block_node),
> +							GFP_KERNEL);
> +		struct rb_node **new = &block_desc_tree.rb_node, *parent = NULL;
> +
> +		if (!block_node) {
> +			delete_rbtree();
> +			return -ENOMEM;
> +		}
> +		if (i > 0 && block_desc[i].slot_size <= block_desc[i-1].slot_size) {
> +			pr_err("%s: block descriptors not in ascending order\n",
> +				__func__);
> +			delete_rbtree();
> +			return -EINVAL;
> +		}
> +		block_node->this_slot_size = block_desc[i].slot_size;
> +		block_node->block_idx = i;
> +		if (i == ARRAY_SIZE(block_desc) - 1)
> +			block_node->next_slot_size = PAGE_SIZE;
> +		else
> +			block_node->next_slot_size = block_desc[i+1].slot_size;
> +		while (*new) {
> +			parent = *new;
> +			/* the array is sorted so we will always go to the right */
> +			new = &((*new)->rb_right);
> +		}
> +		rb_link_node(&block_node->node, parent, new);
> +		rb_insert_color(&block_node->node, &block_desc_tree);
> +	}
> +	return 0;
> +}
> +
> +static int __init init_zblock(void)
> +{
> +	int ret = create_rbtree();
> +
> +	if (ret)
> +		return ret;
> +
> +	zpool_register_driver(&zblock_zpool_driver);
> +	return 0;
> +}
> +
> +static void __exit exit_zblock(void)
> +{
> +	zpool_unregister_driver(&zblock_zpool_driver);
> +	delete_rbtree();
> +}
> +
> +module_init(init_zblock);
> +module_exit(exit_zblock);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Vitaly Wool <vitaly.wool@konsulko.se>");
> +MODULE_DESCRIPTION("Block allocator for compressed pages");
> diff --git a/mm/zblock.h b/mm/zblock.h
> new file mode 100644
> index 000000000000..fd72961c077a
> --- /dev/null
> +++ b/mm/zblock.h
> @@ -0,0 +1,176 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Author: Vitaly Wool <vitaly.wool@konsulko.com>
> + * Copyright (C) 2025, Konsulko AB.
> + */
> +#ifndef __ZBLOCK_H__
> +#define __ZBLOCK_H__
> +
> +#include <linux/mm.h>
> +#include <linux/rbtree.h>
> +#include <linux/types.h>
> +
> +#define SLOT_FREE 0
> +#define BIT_SLOT_OCCUPIED 0
> +#define BIT_SLOT_MAPPED 1
> +
> +#if PAGE_SIZE == 0x1000
> +/* max 128 slots per block, max table size 32 */
> +#define SLOT_BITS 7
> +#elif PAGE_SIZE == 0x4000
> +/* max 256 slots per block, max table size 64 */
> +#define SLOT_BITS 8
> +#else
> +#error Unsupported PAGE_SIZE
> +#endif
> +
> +#define MAX_SLOTS (1 << SLOT_BITS)
> +#define SLOT_MASK ((0x1UL << SLOT_BITS) - 1)
> +
> +#define ZBLOCK_HEADER_SIZE	round_up(sizeof(struct zblock_block), 
> sizeof(long))
> +#define BLOCK_DATA_SIZE(order) ((PAGE_SIZE << order) - 
> ZBLOCK_HEADER_SIZE)
> +#define SLOT_SIZE(nslots, order) (round_down((BLOCK_DATA_SIZE(order) / 
> nslots), sizeof(long)))
> +
> +/**
> + * struct zblock_block - block metadata
> + * Block consists of several (1/2/4/8) pages and contains fixed
> + * integer number of slots for allocating compressed pages.
> + *
> + * free_slots:	number of free slots in the block
> + * slot_info:	contains data about free/occupied slots
> + */
> +struct zblock_block {
> +	struct list_head link;
> +	DECLARE_BITMAP(slot_info, 1 << SLOT_BITS);
> +	u32 free_slots;
> +};
> +
> +/**
> + * struct block_desc - general metadata for block lists
> + * Each block list stores only blocks of corresponding type which 
> means
> + * that all blocks in it have the same number and size of slots.
> + * All slots are aligned to size of long.
> + *
> + * slot_size:		size of slot for this list
> + * slots_per_block:	number of slots per block for this list
> + * order:		order for __get_free_pages
> + */
> +struct block_desc {
> +	unsigned int slot_size;
> +	unsigned short slots_per_block;
> +	unsigned short order;
> +};
> +
> +struct block_desc_node {
> +	struct rb_node node;
> +	unsigned int this_slot_size;
> +	unsigned int next_slot_size;
> +	unsigned int block_idx;
> +};
> +
> +static const struct block_desc block_desc[] = {
> +#if PAGE_SIZE == 0x1000
> +	{ SLOT_SIZE(63, 0), 63, 0 },
> +	{ SLOT_SIZE(32, 0), 32, 0 },
> +	{ SLOT_SIZE(21, 0), 21, 0 },
> +	{ SLOT_SIZE(15, 0), 15, 0 },
> +	{ SLOT_SIZE(12, 0), 12, 0 },
> +	{ SLOT_SIZE(10, 0), 10, 0 },
> +	{ SLOT_SIZE(9, 0), 9, 0 },
> +	{ SLOT_SIZE(8, 0), 8, 0 },
> +	{ SLOT_SIZE(29, 2), 29, 2 },
> +	{ SLOT_SIZE(13, 1), 13, 1 },
> +	{ SLOT_SIZE(6, 0), 6, 0 },
> +	{ SLOT_SIZE(11, 1), 11, 1 },
> +	{ SLOT_SIZE(5, 0), 5, 0 },
> +	{ SLOT_SIZE(9, 1), 9, 1 },
> +	{ SLOT_SIZE(8, 1), 8, 1 },
> +	{ SLOT_SIZE(29, 3), 29, 3 },
> +	{ SLOT_SIZE(13, 2), 13, 2 },
> +	{ SLOT_SIZE(12, 2), 12, 2 },
> +	{ SLOT_SIZE(11, 2), 11, 2 },
> +	{ SLOT_SIZE(10, 2), 10, 2 },
> +	{ SLOT_SIZE(9, 2), 9, 2 },
> +	{ SLOT_SIZE(17, 3), 17, 3 },
> +	{ SLOT_SIZE(8, 2), 8, 2 },
> +	{ SLOT_SIZE(15, 3), 15, 3 },
> +	{ SLOT_SIZE(14, 3), 14, 3 },
> +	{ SLOT_SIZE(13, 3), 13, 3 },
> +	{ SLOT_SIZE(6, 2), 6, 2 },
> +	{ SLOT_SIZE(11, 3), 11, 3 },
> +	{ SLOT_SIZE(10, 3), 10, 3 },
> +	{ SLOT_SIZE(9, 3), 9, 3 },
> +	{ SLOT_SIZE(4, 2), 4, 2 },
> +#elif PAGE_SIZE == 0x4000
> +	{ SLOT_SIZE(255, 0), 255, 0 },
> +	{ SLOT_SIZE(185, 0), 185, 0 },
> +	{ SLOT_SIZE(145, 0), 145, 0 },
> +	{ SLOT_SIZE(113, 0), 113, 0 },
> +	{ SLOT_SIZE(92, 0), 92, 0 },
> +	{ SLOT_SIZE(75, 0), 75, 0 },
> +	{ SLOT_SIZE(60, 0), 60, 0 },
> +	{ SLOT_SIZE(51, 0), 51, 0 },
> +	{ SLOT_SIZE(43, 0), 43, 0 },
> +	{ SLOT_SIZE(37, 0), 37, 0 },
> +	{ SLOT_SIZE(32, 0), 32, 0 },
> +	{ SLOT_SIZE(27, 0), 27, 0 },
> +	{ SLOT_SIZE(23, 0), 23, 0 },
> +	{ SLOT_SIZE(19, 0), 19, 0 },
> +	{ SLOT_SIZE(17, 0), 17, 0 },
> +	{ SLOT_SIZE(15, 0), 15, 0 },
> +	{ SLOT_SIZE(13, 0), 13, 0 },
> +	{ SLOT_SIZE(11, 0), 11, 0 },
> +	{ SLOT_SIZE(10, 0), 10, 0 },
> +	{ SLOT_SIZE(9, 0), 9, 0 },
> +	{ SLOT_SIZE(8, 0), 8, 0 },
> +	{ SLOT_SIZE(15, 1), 15, 1 },
> +	{ SLOT_SIZE(14, 1), 14, 1 },
> +	{ SLOT_SIZE(13, 1), 13, 1 },
> +	{ SLOT_SIZE(12, 1), 12, 1 },
> +	{ SLOT_SIZE(11, 1), 11, 1 },
> +	{ SLOT_SIZE(10, 1), 10, 1 },
> +	{ SLOT_SIZE(9, 1), 9, 1 },
> +	{ SLOT_SIZE(8, 1), 8, 1 },
> +	{ SLOT_SIZE(15, 2), 15, 2 },
> +	{ SLOT_SIZE(14, 2), 14, 2 },
> +	{ SLOT_SIZE(13, 2), 13, 2 },
> +	{ SLOT_SIZE(12, 2), 12, 2 },
> +	{ SLOT_SIZE(11, 2), 11, 2 },
> +	{ SLOT_SIZE(10, 2), 10, 2 },
> +	{ SLOT_SIZE(9, 2), 9, 2 },
> +	{ SLOT_SIZE(8, 2), 8, 2 },
> +	{ SLOT_SIZE(7, 2), 7, 2 },
> +	{ SLOT_SIZE(6, 2), 6, 2 },
> +	{ SLOT_SIZE(5, 2), 5, 2 },
> +#endif /* PAGE_SIZE */
> +};
> +
> +/**
> + * struct block_list - stores metadata of particular list
> + * lock:		protects the list of blocks
> + * active_list:		linked list of active (non-full) blocks
> + * full_list:		linked list of full blocks
> + * block_count:		total number of blocks in the list
> + */
> +struct block_list {
> +	spinlock_t lock;
> +	struct list_head active_list;
> +	struct list_head full_list;
> +	unsigned long block_count;
> +};
> +
> +/**
> + * struct zblock_pool - stores metadata for each zblock pool
> + * @block_lists:	array of block lists
> + * @zpool:		zpool driver
> + *
> + * This structure is allocated at pool creation time and maintains 
> metadata
> + * for a particular zblock pool.
> + */
> +struct zblock_pool {
> +	struct block_list block_lists[ARRAY_SIZE(block_desc)];
> +	struct zpool *zpool;
> +};
> +
> +
> +#endif

Johannes Weiner April 16, 2025, 12:09 p.m. UTC | #2

On Sat, Apr 12, 2025 at 05:42:07PM +0200, Vitaly Wool wrote:
> zblock is a special purpose allocator for storing compressed pages.
> It stores integer number of same size objects per its block. These
> blocks consist of several physical pages (2**n, i. e. 1/2/4/8).
> 
> With zblock, it is possible to densely arrange objects of various sizes
> resulting in low internal fragmentation. Also this allocator tries to
> fill incomplete blocks instead of adding new ones, in many cases
> providing a compression ratio comparable to zmalloc's.
> 
> zblock is also in most cases superior to zsmalloc with regard to
> average performance and worst execution times, thus allowing for better
> response time and real-time characteristics of the whole system.

Is there a reason not to use this allocation scheme in zsmalloc then?

I'm curious what others think, but I'm still not convinced a second
allocator makes sense. It's maintenance overhead, a permanent struggle
to match feature parity, and it fragments development and testing base.

Not long ago several slab allocators were removed for those
reasons. Likewise, we just deleted zbud and z3fold because they didn't
get any attention and bitrotted, but not before years of inflicting
pain through the zpool interface, users accidentally making very
suboptimal choices, reporting the same bugs over and over again etc.

If you discovered a better allocation scheme, that's excellent. But I
don't see why it warrants forking the entire allocator.

I also don't buy the fragmentation argument. Even if you are better at
packing during allocation time (although, citation needed), the
workload can unmap swap pages such that they leave partial blocks
behind indefinitely if you don't have block compaction.

Then there is the migration support, which you said is planned, but
which would presumably require the same indirection between handle and
the backing pages that zsmalloc has. How much will this eat into the
performance advantage?

I'd much rather you'd focus on making zsmalloc better. Improve the
packing scheme, make expensive features optional/configurable etc.
That would be easier on developers and users alike.

diff --git a/Documentation/mm/zblock.rst b/Documentation/mm/zblock.rst
new file mode 100644
index 000000000000..9751434d0b76
--- /dev/null
+++ b/Documentation/mm/zblock.rst
@@ -0,0 +1,24 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+======
+zblock
+======
+
+zblock is a special purpose allocator for storing compressed pages.
+It stores integer number of compressed objects per its block. These
+blocks consist of several physical pages (2**n, i. e. 1/2/4/8).
+
+With zblock, it is possible to densely arrange objects of various sizes
+resulting in low internal fragmentation. Also this allocator tries to
+fill incomplete blocks instead of adding new ones,  in many cases
+providing a compression ratio substantially higher than z3fold and zbud
+(though lower than zmalloc's).
+
+zblock does not require MMU to operate and also is superior to zsmalloc
+with regard to average performance and worst execution times, thus
+allowing for better response time and real-time characteristics of the
+whole system.
+
+E. g. on a series of stress-ng tests run on a Raspberry Pi 5, we get
+5-10% higher value for bogo ops/s in zblock/zsmalloc comparison.
+
diff --git a/MAINTAINERS b/MAINTAINERS
index 96b827049501..46465c986005 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -26640,6 +26640,13 @@  F:	Documentation/networking/device_drivers/hamradio/z8530drv.rst
 F:	drivers/net/hamradio/*scc.c
 F:	drivers/net/hamradio/z8530.h
 
+ZBLOCK COMPRESSED SLAB MEMORY ALLOCATOR
+M:	Vitaly Wool <vitaly.wool@konsulko.se>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	Documentation/mm/zblock.rst
+F:	mm/zblock.[ch]
+
 ZD1211RW WIRELESS DRIVER
 L:	linux-wireless@vger.kernel.org
 S:	Orphan
diff --git a/mm/Kconfig b/mm/Kconfig
index e113f713b493..5aa1479151ec 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -152,6 +152,18 @@  config ZSWAP_ZPOOL_DEFAULT
        default "zsmalloc" if ZSWAP_ZPOOL_DEFAULT_ZSMALLOC
        default ""
 
+config ZBLOCK
+	tristate "Fast compression allocator with high density"
+	depends on ZPOOL
+	help
+	  A special purpose allocator for storing compressed pages.
+	  It stores integer number of same size compressed objects per
+	  its block. These blocks consist of several physical pages
+	  (2**n, i. e. 1/2/4/8).
+
+	  With zblock, it is possible to densely arrange objects of
+	  various sizes resulting in low internal fragmentation.
+
 config ZSMALLOC
 	tristate
 	prompt "N:1 compression allocator (zsmalloc)" if (ZSWAP || ZRAM)
diff --git a/mm/Makefile b/mm/Makefile
index e7f6bbf8ae5f..9d7e5b5bb694 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -116,6 +116,7 @@  obj-$(CONFIG_DEBUG_VM_PGTABLE) += debug_vm_pgtable.o
 obj-$(CONFIG_PAGE_OWNER) += page_owner.o
 obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
 obj-$(CONFIG_ZPOOL)	+= zpool.o
+obj-$(CONFIG_ZBLOCK)	+= zblock.o
 obj-$(CONFIG_ZSMALLOC)	+= zsmalloc.o
 obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
 obj-$(CONFIG_CMA)	+= cma.o
diff --git a/mm/zblock.c b/mm/zblock.c
new file mode 100644
index 000000000000..ecc7aeb611af
--- /dev/null
+++ b/mm/zblock.c
@@ -0,0 +1,443 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * zblock.c
+ *
+ * Author: Vitaly Wool <vitaly.wool@konsulko.se>
+ * Based on the work from Ananda Badmaev <a.badmaev@clicknet.pro>
+ * Copyright (C) 2022-2025, Konsulko AB.
+ *
+ * Zblock is a small object allocator with the intention to serve as a
+ * zpool backend. It operates on page blocks which consist of number
+ * of physical pages being a power of 2 and store integer number of
+ * compressed pages per block which results in determinism and simplicity.
+ *
+ * zblock doesn't export any API and is meant to be used via zpool API.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/atomic.h>
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/preempt.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/zpool.h>
+#include "zblock.h"
+
+static struct rb_root block_desc_tree = RB_ROOT;
+
+/* Encode handle of a particular slot in the pool using metadata */
+static inline unsigned long metadata_to_handle(struct zblock_block *block,
+				unsigned int block_type, unsigned int slot)
+{
+	return (unsigned long)(block) | (block_type << SLOT_BITS) | slot;
+}
+
+/* Return block, block type and slot in the pool corresponding to handle */
+static inline struct zblock_block *handle_to_metadata(unsigned long handle,
+				unsigned int *block_type, unsigned int *slot)
+{
+	*block_type = (handle & (PAGE_SIZE - 1)) >> SLOT_BITS;
+	*slot = handle & SLOT_MASK;
+	return (struct zblock_block *)(handle & PAGE_MASK);
+}
+
+/*
+ * Find a block with at least one free slot and claim it.
+ * We make sure that the first block, if exists, will always work.
+ */
+static inline struct zblock_block *find_and_claim_block(struct block_list *b,
+		int block_type, unsigned long *handle)
+{
+	struct list_head *l = &b->active_list;
+	unsigned int slot;
+
+	if (!list_empty(l)) {
+		struct zblock_block *z = list_first_entry(l, typeof(*z), link);
+
+		if (--z->free_slots == 0)
+			list_move(&z->link, &b->full_list);
+		/*
+		 * There is a slot in the block and we just made sure it would
+		 * remain.
+		 * Find that slot and set the busy bit.
+		 */
+		for (slot = find_first_zero_bit(z->slot_info,
+					block_desc[block_type].slots_per_block);
+		     slot < block_desc[block_type].slots_per_block;
+		     slot = find_next_zero_bit(z->slot_info,
+					block_desc[block_type].slots_per_block,
+					slot)) {
+			if (!test_and_set_bit(slot, z->slot_info))
+				break;
+			barrier();
+		}
+
+		WARN_ON(slot >= block_desc[block_type].slots_per_block);
+		*handle = metadata_to_handle(z, block_type, slot);
+		return z;
+	}
+	return NULL;
+}
+
+/*
+ * allocate new block and add it to corresponding block list
+ */
+static struct zblock_block *alloc_block(struct zblock_pool *pool,
+					int block_type, gfp_t gfp,
+					unsigned long *handle)
+{
+	struct zblock_block *block;
+	struct block_list *block_list;
+
+	block = (void *)__get_free_pages(gfp, block_desc[block_type].order);
+	if (!block)
+		return NULL;
+
+	block_list = &pool->block_lists[block_type];
+
+	/* init block data  */
+	block->free_slots = block_desc[block_type].slots_per_block - 1;
+	memset(&block->slot_info, 0, sizeof(block->slot_info));
+	set_bit(0, block->slot_info);
+	*handle = metadata_to_handle(block, block_type, 0);
+
+	spin_lock(&block_list->lock);
+	list_add(&block->link, &block_list->active_list);
+	block_list->block_count++;
+	spin_unlock(&block_list->lock);
+	return block;
+}
+
+/*****************
+ * API Functions
+ *****************/
+/**
+ * zblock_create_pool() - create a new zblock pool
+ * @gfp:	gfp flags when allocating the zblock pool structure
+ * @ops:	user-defined operations for the zblock pool
+ *
+ * Return: pointer to the new zblock pool or NULL if the metadata allocation
+ * failed.
+ */
+static struct zblock_pool *zblock_create_pool(gfp_t gfp)
+{
+	struct zblock_pool *pool;
+	struct block_list *block_list;
+	int i;
+
+	pool = kmalloc(sizeof(struct zblock_pool), gfp);
+	if (!pool)
+		return NULL;
+
+	/* init each block list */
+	for (i = 0; i < ARRAY_SIZE(block_desc); i++) {
+		block_list = &pool->block_lists[i];
+		spin_lock_init(&block_list->lock);
+		INIT_LIST_HEAD(&block_list->full_list);
+		INIT_LIST_HEAD(&block_list->active_list);
+		block_list->block_count = 0;
+	}
+	return pool;
+}
+
+/**
+ * zblock_destroy_pool() - destroys an existing zblock pool
+ * @pool:	the zblock pool to be destroyed
+ *
+ */
+static void zblock_destroy_pool(struct zblock_pool *pool)
+{
+	kfree(pool);
+}
+
+
+/**
+ * zblock_alloc() - allocates a slot of appropriate size
+ * @pool:	zblock pool from which to allocate
+ * @size:	size in bytes of the desired allocation
+ * @gfp:	gfp flags used if the pool needs to grow
+ * @handle:	handle of the new allocation
+ *
+ * Return: 0 if success and handle is set, otherwise -EINVAL if the size or
+ * gfp arguments are invalid or -ENOMEM if the pool was unable to allocate
+ * a new slot.
+ */
+static int zblock_alloc(struct zblock_pool *pool, size_t size, gfp_t gfp,
+			unsigned long *handle)
+{
+	int block_type = -1;
+	struct zblock_block *block;
+	struct block_list *block_list;
+
+	if (!size)
+		return -EINVAL;
+
+	if (size > PAGE_SIZE)
+		return -ENOSPC;
+
+	/* find basic block type with suitable slot size */
+	if (size < block_desc[0].slot_size)
+		block_type = 0;
+	else {
+		struct block_desc_node *block_node;
+		struct rb_node *node = block_desc_tree.rb_node;
+
+		while (node) {
+			block_node = container_of(node, typeof(*block_node), node);
+			if (size < block_node->this_slot_size)
+				node = node->rb_left;
+			else if (size >= block_node->next_slot_size)
+				node = node->rb_right;
+			else {
+				block_type = block_node->block_idx + 1;
+				break;
+			}
+		}
+	}
+	if (WARN_ON(block_type < 0))
+		return -EINVAL;
+	if (block_type >= ARRAY_SIZE(block_desc))
+		return -ENOSPC;
+
+	block_list = &pool->block_lists[block_type];
+
+	spin_lock(&block_list->lock);
+	block = find_and_claim_block(block_list, block_type, handle);
+	spin_unlock(&block_list->lock);
+	if (block)
+		return 0;
+
+	/* not found block with free slots try to allocate new empty block */
+	block = alloc_block(pool, block_type, gfp & ~(__GFP_MOVABLE | __GFP_HIGHMEM), handle);
+	return block ? 0 : -ENOMEM;
+}
+
+/**
+ * zblock_free() - frees the allocation associated with the given handle
+ * @pool:	pool in which the allocation resided
+ * @handle:	handle associated with the allocation returned by zblock_alloc()
+ *
+ */
+static void zblock_free(struct zblock_pool *pool, unsigned long handle)
+{
+	unsigned int slot, block_type;
+	struct zblock_block *block;
+	struct block_list *block_list;
+
+	block = handle_to_metadata(handle, &block_type, &slot);
+	block_list = &pool->block_lists[block_type];
+
+	spin_lock(&block_list->lock);
+	/* if all slots in block are empty delete whole block */
+	if (++block->free_slots == block_desc[block_type].slots_per_block) {
+		block_list->block_count--;
+		list_del(&block->link);
+		spin_unlock(&block_list->lock);
+		free_pages((unsigned long)block, block_desc[block_type].order);
+		return;
+	} else if (block->free_slots == 1)
+		list_move_tail(&block->link, &block_list->active_list);
+	clear_bit(slot, block->slot_info);
+	spin_unlock(&block_list->lock);
+}
+
+/**
+ * zblock_map() - maps the allocation associated with the given handle
+ * @pool:	pool in which the allocation resides
+ * @handle:	handle associated with the allocation to be mapped
+ *
+ *
+ * Returns: a pointer to the mapped allocation
+ */
+static void *zblock_map(struct zblock_pool *pool, unsigned long handle)
+{
+	unsigned int block_type, slot;
+	struct zblock_block *block;
+	unsigned long offs;
+	void *p;
+
+	block = handle_to_metadata(handle, &block_type, &slot);
+	offs = ZBLOCK_HEADER_SIZE + slot * block_desc[block_type].slot_size;
+	p = (void *)block + offs;
+	return p;
+}
+
+/**
+ * zblock_unmap() - unmaps the allocation associated with the given handle
+ * @pool:	pool in which the allocation resides
+ * @handle:	handle associated with the allocation to be unmapped
+ */
+static void zblock_unmap(struct zblock_pool *pool, unsigned long handle)
+{
+}
+
+/**
+ * zblock_write() - write to the memory area defined by handle
+ * @pool:	pool in which the allocation resides
+ * @handle:	handle associated with the allocation
+ * @handle_mem: pointer to source memory block
+ * @mem_len:	length of the memory block to write
+ */
+static void zblock_write(struct zblock_pool *pool, unsigned long handle,
+			 void *handle_mem, size_t mem_len)
+{
+	unsigned int block_type, slot;
+	struct zblock_block *block;
+	unsigned long offs;
+	void *p;
+
+	block = handle_to_metadata(handle, &block_type, &slot);
+	offs = ZBLOCK_HEADER_SIZE + slot * block_desc[block_type].slot_size;
+	p = (void *)block + offs;
+	memcpy(p, handle_mem, mem_len);
+}
+
+/**
+ * zblock_get_total_pages() - gets the zblock pool size in pages
+ * @pool:	pool being queried
+ *
+ * Returns: size in bytes of the given pool.
+ */
+static u64 zblock_get_total_pages(struct zblock_pool *pool)
+{
+	u64 total_size;
+	int i;
+
+	total_size = 0;
+	for (i = 0; i < ARRAY_SIZE(block_desc); i++)
+		total_size += pool->block_lists[i].block_count << block_desc[i].order;
+
+	return total_size;
+}
+
+/*****************
+ * zpool
+ ****************/
+
+static void *zblock_zpool_create(const char *name, gfp_t gfp)
+{
+	return zblock_create_pool(gfp);
+}
+
+static void zblock_zpool_destroy(void *pool)
+{
+	zblock_destroy_pool(pool);
+}
+
+static int zblock_zpool_malloc(void *pool, size_t size, gfp_t gfp,
+			unsigned long *handle)
+{
+	return zblock_alloc(pool, size, gfp, handle);
+}
+
+static void zblock_zpool_free(void *pool, unsigned long handle)
+{
+	zblock_free(pool, handle);
+}
+
+static void *zblock_zpool_read_begin(void *pool, unsigned long handle,
+				void *local_copy)
+{
+	return zblock_map(pool, handle);
+}
+
+static void zblock_zpool_obj_write(void *pool, unsigned long handle,
+				void *handle_mem, size_t mem_len)
+{
+	zblock_write(pool, handle, handle_mem, mem_len);
+}
+
+static void zblock_zpool_read_end(void *pool, unsigned long handle,
+				void *handle_mem)
+{
+	zblock_unmap(pool, handle);
+}
+
+static u64 zblock_zpool_total_pages(void *pool)
+{
+	return zblock_get_total_pages(pool);
+}
+
+static struct zpool_driver zblock_zpool_driver = {
+	.type =			"zblock",
+	.owner =		THIS_MODULE,
+	.create =		zblock_zpool_create,
+	.destroy =		zblock_zpool_destroy,
+	.malloc =		zblock_zpool_malloc,
+	.free =			zblock_zpool_free,
+	.obj_read_begin =	zblock_zpool_read_begin,
+	.obj_read_end =		zblock_zpool_read_end,
+	.obj_write =		zblock_zpool_obj_write,
+	.total_pages =		zblock_zpool_total_pages,
+};
+
+MODULE_ALIAS("zpool-zblock");
+
+static void delete_rbtree(void)
+{
+	while (!RB_EMPTY_ROOT(&block_desc_tree))
+		rb_erase(block_desc_tree.rb_node, &block_desc_tree);
+}
+
+static int __init create_rbtree(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(block_desc); i++) {
+		struct block_desc_node *block_node = kmalloc(sizeof(*block_node),
+							GFP_KERNEL);
+		struct rb_node **new = &block_desc_tree.rb_node, *parent = NULL;
+
+		if (!block_node) {
+			delete_rbtree();
+			return -ENOMEM;
+		}
+		if (i > 0 && block_desc[i].slot_size <= block_desc[i-1].slot_size) {
+			pr_err("%s: block descriptors not in ascending order\n",
+				__func__);
+			delete_rbtree();
+			return -EINVAL;
+		}
+		block_node->this_slot_size = block_desc[i].slot_size;
+		block_node->block_idx = i;
+		if (i == ARRAY_SIZE(block_desc) - 1)
+			block_node->next_slot_size = PAGE_SIZE;
+		else
+			block_node->next_slot_size = block_desc[i+1].slot_size;
+		while (*new) {
+			parent = *new;
+			/* the array is sorted so we will always go to the right */
+			new = &((*new)->rb_right);
+		}
+		rb_link_node(&block_node->node, parent, new);
+		rb_insert_color(&block_node->node, &block_desc_tree);
+	}
+	return 0;
+}
+
+static int __init init_zblock(void)
+{
+	int ret = create_rbtree();
+
+	if (ret)
+		return ret;
+
+	zpool_register_driver(&zblock_zpool_driver);
+	return 0;
+}
+
+static void __exit exit_zblock(void)
+{
+	zpool_unregister_driver(&zblock_zpool_driver);
+	delete_rbtree();
+}
+
+module_init(init_zblock);
+module_exit(exit_zblock);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Vitaly Wool <vitaly.wool@konsulko.se>");
+MODULE_DESCRIPTION("Block allocator for compressed pages");
diff --git a/mm/zblock.h b/mm/zblock.h
new file mode 100644
index 000000000000..fd72961c077a
--- /dev/null
+++ b/mm/zblock.h
@@ -0,0 +1,176 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Author: Vitaly Wool <vitaly.wool@konsulko.com>
+ * Copyright (C) 2025, Konsulko AB.
+ */
+#ifndef __ZBLOCK_H__
+#define __ZBLOCK_H__
+
+#include <linux/mm.h>
+#include <linux/rbtree.h>
+#include <linux/types.h>
+
+#define SLOT_FREE 0
+#define BIT_SLOT_OCCUPIED 0
+#define BIT_SLOT_MAPPED 1
+
+#if PAGE_SIZE == 0x1000
+/* max 128 slots per block, max table size 32 */
+#define SLOT_BITS 7
+#elif PAGE_SIZE == 0x4000
+/* max 256 slots per block, max table size 64 */
+#define SLOT_BITS 8
+#else
+#error Unsupported PAGE_SIZE
+#endif
+
+#define MAX_SLOTS (1 << SLOT_BITS)
+#define SLOT_MASK ((0x1UL << SLOT_BITS) - 1)
+
+#define ZBLOCK_HEADER_SIZE	round_up(sizeof(struct zblock_block), sizeof(long))
+#define BLOCK_DATA_SIZE(order) ((PAGE_SIZE << order) - ZBLOCK_HEADER_SIZE)
+#define SLOT_SIZE(nslots, order) (round_down((BLOCK_DATA_SIZE(order) / nslots), sizeof(long)))
+
+/**
+ * struct zblock_block - block metadata
+ * Block consists of several (1/2/4/8) pages and contains fixed
+ * integer number of slots for allocating compressed pages.
+ *
+ * free_slots:	number of free slots in the block
+ * slot_info:	contains data about free/occupied slots
+ */
+struct zblock_block {
+	struct list_head link;
+	DECLARE_BITMAP(slot_info, 1 << SLOT_BITS);
+	u32 free_slots;
+};
+
+/**
+ * struct block_desc - general metadata for block lists
+ * Each block list stores only blocks of corresponding type which means
+ * that all blocks in it have the same number and size of slots.
+ * All slots are aligned to size of long.
+ *
+ * slot_size:		size of slot for this list
+ * slots_per_block:	number of slots per block for this list
+ * order:		order for __get_free_pages
+ */
+struct block_desc {
+	unsigned int slot_size;
+	unsigned short slots_per_block;
+	unsigned short order;
+};
+
+struct block_desc_node {
+	struct rb_node node;
+	unsigned int this_slot_size;
+	unsigned int next_slot_size;
+	unsigned int block_idx;
+};
+
+static const struct block_desc block_desc[] = {
+#if PAGE_SIZE == 0x1000
+	{ SLOT_SIZE(63, 0), 63, 0 },
+	{ SLOT_SIZE(32, 0), 32, 0 },
+	{ SLOT_SIZE(21, 0), 21, 0 },
+	{ SLOT_SIZE(15, 0), 15, 0 },
+	{ SLOT_SIZE(12, 0), 12, 0 },
+	{ SLOT_SIZE(10, 0), 10, 0 },
+	{ SLOT_SIZE(9, 0), 9, 0 },
+	{ SLOT_SIZE(8, 0), 8, 0 },
+	{ SLOT_SIZE(29, 2), 29, 2 },
+	{ SLOT_SIZE(13, 1), 13, 1 },
+	{ SLOT_SIZE(6, 0), 6, 0 },
+	{ SLOT_SIZE(11, 1), 11, 1 },
+	{ SLOT_SIZE(5, 0), 5, 0 },
+	{ SLOT_SIZE(9, 1), 9, 1 },
+	{ SLOT_SIZE(8, 1), 8, 1 },
+	{ SLOT_SIZE(29, 3), 29, 3 },
+	{ SLOT_SIZE(13, 2), 13, 2 },
+	{ SLOT_SIZE(12, 2), 12, 2 },
+	{ SLOT_SIZE(11, 2), 11, 2 },
+	{ SLOT_SIZE(10, 2), 10, 2 },
+	{ SLOT_SIZE(9, 2), 9, 2 },
+	{ SLOT_SIZE(17, 3), 17, 3 },
+	{ SLOT_SIZE(8, 2), 8, 2 },
+	{ SLOT_SIZE(15, 3), 15, 3 },
+	{ SLOT_SIZE(14, 3), 14, 3 },
+	{ SLOT_SIZE(13, 3), 13, 3 },
+	{ SLOT_SIZE(6, 2), 6, 2 },
+	{ SLOT_SIZE(11, 3), 11, 3 },
+	{ SLOT_SIZE(10, 3), 10, 3 },
+	{ SLOT_SIZE(9, 3), 9, 3 },
+	{ SLOT_SIZE(4, 2), 4, 2 },
+#elif PAGE_SIZE == 0x4000
+	{ SLOT_SIZE(255, 0), 255, 0 },
+	{ SLOT_SIZE(185, 0), 185, 0 },
+	{ SLOT_SIZE(145, 0), 145, 0 },
+	{ SLOT_SIZE(113, 0), 113, 0 },
+	{ SLOT_SIZE(92, 0), 92, 0 },
+	{ SLOT_SIZE(75, 0), 75, 0 },
+	{ SLOT_SIZE(60, 0), 60, 0 },
+	{ SLOT_SIZE(51, 0), 51, 0 },
+	{ SLOT_SIZE(43, 0), 43, 0 },
+	{ SLOT_SIZE(37, 0), 37, 0 },
+	{ SLOT_SIZE(32, 0), 32, 0 },
+	{ SLOT_SIZE(27, 0), 27, 0 },
+	{ SLOT_SIZE(23, 0), 23, 0 },
+	{ SLOT_SIZE(19, 0), 19, 0 },
+	{ SLOT_SIZE(17, 0), 17, 0 },
+	{ SLOT_SIZE(15, 0), 15, 0 },
+	{ SLOT_SIZE(13, 0), 13, 0 },
+	{ SLOT_SIZE(11, 0), 11, 0 },
+	{ SLOT_SIZE(10, 0), 10, 0 },
+	{ SLOT_SIZE(9, 0), 9, 0 },
+	{ SLOT_SIZE(8, 0), 8, 0 },
+	{ SLOT_SIZE(15, 1), 15, 1 },
+	{ SLOT_SIZE(14, 1), 14, 1 },
+	{ SLOT_SIZE(13, 1), 13, 1 },
+	{ SLOT_SIZE(12, 1), 12, 1 },
+	{ SLOT_SIZE(11, 1), 11, 1 },
+	{ SLOT_SIZE(10, 1), 10, 1 },
+	{ SLOT_SIZE(9, 1), 9, 1 },
+	{ SLOT_SIZE(8, 1), 8, 1 },
+	{ SLOT_SIZE(15, 2), 15, 2 },
+	{ SLOT_SIZE(14, 2), 14, 2 },
+	{ SLOT_SIZE(13, 2), 13, 2 },
+	{ SLOT_SIZE(12, 2), 12, 2 },
+	{ SLOT_SIZE(11, 2), 11, 2 },
+	{ SLOT_SIZE(10, 2), 10, 2 },
+	{ SLOT_SIZE(9, 2), 9, 2 },
+	{ SLOT_SIZE(8, 2), 8, 2 },
+	{ SLOT_SIZE(7, 2), 7, 2 },
+	{ SLOT_SIZE(6, 2), 6, 2 },
+	{ SLOT_SIZE(5, 2), 5, 2 },
+#endif /* PAGE_SIZE */
+};
+
+/**
+ * struct block_list - stores metadata of particular list
+ * lock:		protects the list of blocks
+ * active_list:		linked list of active (non-full) blocks
+ * full_list:		linked list of full blocks
+ * block_count:		total number of blocks in the list
+ */
+struct block_list {
+	spinlock_t lock;
+	struct list_head active_list;
+	struct list_head full_list;
+	unsigned long block_count;
+};
+
+/**
+ * struct zblock_pool - stores metadata for each zblock pool
+ * @block_lists:	array of block lists
+ * @zpool:		zpool driver
+ *
+ * This structure is allocated at pool creation time and maintains metadata
+ * for a particular zblock pool.
+ */
+struct zblock_pool {
+	struct block_list block_lists[ARRAY_SIZE(block_desc)];
+	struct zpool *zpool;
+};
+
+
+#endif

[v4] mm: add zblock allocator

Commit Message

Comments

Patch