diff mbox series

mm: add zblock allocator

Message ID 20250401171754.2686501-1-vitaly.wool@konsulko.se (mailing list archive)
State New
Headers show
Series mm: add zblock allocator | expand

Commit Message

Vitaly April 1, 2025, 5:17 p.m. UTC
zblock is a special purpose allocator for storing compressed pages.
It stores integer number of compressed objects per its block. These
blocks consist of several physical pages (2**n, i. e. 1/2/4/8).

With zblock, it is possible to densely arrange objects of various sizes
resulting in low internal fragmentation. Also this allocator tries to
fill incomplete blocks instead of adding new ones,  in many cases
providing a compression ratio substantially higher than z3fold and zbud
(though lower than zmalloc's).

zblock does not require MMU to operate and also is superior to zsmalloc
with regard to average performance and worst execution times, thus
allowing for better response time and real-time characteristics of the
whole system.

E. g. on a series of stress-ng tests run on a Raspberry Pi 5, we get
5-10% higher value for bogo ops/s in zblock/zsmalloc comparison.

Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.se>
Signed-off-by: Igor Belousov <igor.b@beldev.am>
---
 Documentation/mm/zblock.rst |  22 ++
 MAINTAINERS                 |   7 +
 mm/Kconfig                  |   8 +
 mm/Makefile                 |   1 +
 mm/zblock.c                 | 492 ++++++++++++++++++++++++++++++++++++
 5 files changed, 530 insertions(+)
 create mode 100644 Documentation/mm/zblock.rst
 create mode 100644 mm/zblock.c

Comments

Nhat Pham April 1, 2025, 6:24 p.m. UTC | #1
On Tue, Apr 1, 2025 at 10:18 AM Vitaly Wool <vitaly.wool@konsulko.se> wrote:
>
> zblock is a special purpose allocator for storing compressed pages.
> It stores integer number of compressed objects per its block. These
> blocks consist of several physical pages (2**n, i. e. 1/2/4/8).

Haven't taken a close look yet, but as a general principle I don't
mind having a separate allocator for a separate use case.

Some quick notes (will do a careful review later):

>
> With zblock, it is possible to densely arrange objects of various sizes
> resulting in low internal fragmentation. Also this allocator tries to
> fill incomplete blocks instead of adding new ones,  in many cases
> providing a compression ratio substantially higher than z3fold and zbud
> (though lower than zmalloc's).

Do we have data for comparison here?

>
> zblock does not require MMU to operate and also is superior to zsmalloc

This is not an actual meaningful distinction. CONFIG_SWAP depends on CONFIG_MMU:

menuconfig SWAP
    bool "Support for paging of anonymous memory (swap)"
    depends on MMU && BLOCK && !ARCH_NO_SWAP


> with regard to average performance and worst execution times, thus
> allowing for better response time and real-time characteristics of the
> whole system.

By performance, do you mean latency or throughput or storage density?

>
> E. g. on a series of stress-ng tests run on a Raspberry Pi 5, we get
> 5-10% higher value for bogo ops/s in zblock/zsmalloc comparison.
>
> Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.se>
> Signed-off-by: Igor Belousov <igor.b@beldev.am>
> ---
>  Documentation/mm/zblock.rst |  22 ++
>  MAINTAINERS                 |   7 +
>  mm/Kconfig                  |   8 +
>  mm/Makefile                 |   1 +
>  mm/zblock.c                 | 492 ++++++++++++++++++++++++++++++++++++
>  5 files changed, 530 insertions(+)
>  create mode 100644 Documentation/mm/zblock.rst
>  create mode 100644 mm/zblock.c
>
> diff --git a/Documentation/mm/zblock.rst b/Documentation/mm/zblock.rst
> new file mode 100644
> index 000000000000..754b3dbb9e94
> --- /dev/null
> +++ b/Documentation/mm/zblock.rst
> @@ -0,0 +1,22 @@
> +======
> +zblock
> +======
> +
> +zblock is a special purpose allocator for storing compressed pages.
> +It stores integer number of compressed objects per its block. These
> +blocks consist of several physical pages (2**n, i. e. 1/2/4/8).
> +
> +With zblock, it is possible to densely arrange objects of various sizes
> +resulting in low internal fragmentation. Also this allocator tries to
> +fill incomplete blocks instead of adding new ones,  in many cases
> +providing a compression ratio substantially higher than z3fold and zbud
> +(though lower than zmalloc's).
> +
> +zblock does not require MMU to operate and also is superior to zsmalloc

Same note as above.

> +with regard to average performance and worst execution times, thus
> +allowing for better response time and real-time characteristics of the
> +whole system.
> +
> +E. g. on a series of stress-ng tests run on a Raspberry Pi 5, we get
> +5-10% higher value for bogo ops/s in zblock/zsmalloc comparison.
> +
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 991a33bad10e..166e9bfa04dc 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -26313,6 +26313,13 @@ F:     Documentation/networking/device_drivers/hamradio/z8530drv.rst
>  F:     drivers/net/hamradio/*scc.c
>  F:     drivers/net/hamradio/z8530.h
>
> +ZBLOCK COMPRESSED SLAB MEMORY ALLOCATOR
> +M:     Vitaly Wool <vitaly.wool@konsulko.se>
> +L:     linux-mm@kvack.org
> +S:     Maintained
> +F:     Documentation/mm/zblock.rst
> +F:     mm/zblock.c
> +
>  ZBUD COMPRESSED PAGE ALLOCATOR
>  M:     Seth Jennings <sjenning@redhat.com>
>  M:     Dan Streetman <ddstreet@ieee.org>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 1b501db06417..26b79e3c1300 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -193,6 +193,14 @@ config Z3FOLD_DEPRECATED
>           page. It is a ZBUD derivative so the simplicity and determinism are
>           still there.
>
> +config ZBLOCK
> +       tristate "Fast compression allocator with high density"
> +       depends on ZPOOL
> +       help
> +         A special purpose allocator for storing compressed pages.
> +         It is designed to store same size compressed pages in blocks of
> +         physical pages.
> +
>  config Z3FOLD
>         tristate
>         default y if Z3FOLD_DEPRECATED=y
> diff --git a/mm/Makefile b/mm/Makefile
> index 850386a67b3e..2018455b7baa 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -116,6 +116,7 @@ obj-$(CONFIG_ZPOOL) += zpool.o
>  obj-$(CONFIG_ZBUD)     += zbud.o
>  obj-$(CONFIG_ZSMALLOC) += zsmalloc.o
>  obj-$(CONFIG_Z3FOLD)   += z3fold.o
> +obj-$(CONFIG_ZBLOCK)   += zblock.o
>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>  obj-$(CONFIG_CMA)      += cma.o
>  obj-$(CONFIG_NUMA) += numa.o
> diff --git a/mm/zblock.c b/mm/zblock.c
> new file mode 100644
> index 000000000000..a6778653c451
> --- /dev/null
> +++ b/mm/zblock.c
> @@ -0,0 +1,492 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * zblock.c
> + *
> + * Author: Vitaly Wool <vitaly.wool@konsulko.com>
> + * Based on the work from Ananda Badmaev <a.badmaev@clicknet.pro>
> + * Copyright (C) 2022-2024, Konsulko AB.
> + *
> + * Zblock is a small object allocator with the intention to serve as a
> + * zpool backend. It operates on page blocks which consist of number
> + * of physical pages being a power of 2 and store integer number of
> + * compressed pages per block which results in determinism and simplicity.
> + *
> + * zblock doesn't export any API and is meant to be used via zpool API.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/atomic.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/preempt.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/zpool.h>
> +
> +#define SLOT_FREE 0
> +#define BIT_SLOT_OCCUPIED 0
> +#define BIT_SLOT_MAPPED 1
> +
> +#define SLOT_BITS 5
> +#define MAX_SLOTS (1 << SLOT_BITS)
> +#define SLOT_MASK ((0x1UL << SLOT_BITS) - 1)
> +
> +#define ZBLOCK_HEADER_SIZE     round_up(sizeof(struct zblock_block), sizeof(long))
> +#define BLOCK_DATA_SIZE(order) ((PAGE_SIZE << order) - ZBLOCK_HEADER_SIZE)
> +#define SLOT_SIZE(nslots, order) (round_down((BLOCK_DATA_SIZE(order) / nslots), sizeof(long)))
> +
> +#define BLOCK_CACHE_SIZE 32
> +
> +struct zblock_pool;
> +
> +/**
> + * struct zblock_block - block metadata
> + * Block consists of several (1/2/4/8) pages and contains fixed
> + * integer number of slots for allocating compressed pages.
> + *
> + * free_slots: number of free slots in the block
> + * slot_info:  contains data about free/occupied slots
> + * cache_idx:  index of the block in cache
> + */
> +struct zblock_block {
> +       atomic_t free_slots;
> +       u64 slot_info[1];
> +       int cache_idx;
> +};
> +
> +/**
> + * struct block_desc - general metadata for block lists
> + * Each block list stores only blocks of corresponding type which means
> + * that all blocks in it have the same number and size of slots.
> + * All slots are aligned to size of long.
> + *
> + * slot_size:          size of slot for this list
> + * slots_per_block:    number of slots per block for this list
> + * order:              order for __get_free_pages
> + */
> +static const struct block_desc {
> +       const unsigned int slot_size;
> +       const unsigned short slots_per_block;
> +       const unsigned short order;
> +} block_desc[] = {
> +       { SLOT_SIZE(32, 0), 32, 0 },
> +       { SLOT_SIZE(22, 0), 22, 0 },
> +       { SLOT_SIZE(17, 0), 17, 0 },
> +       { SLOT_SIZE(13, 0), 13, 0 },
> +       { SLOT_SIZE(11, 0), 11, 0 },
> +       { SLOT_SIZE(9, 0), 9, 0 },
> +       { SLOT_SIZE(8, 0), 8, 0 },
> +       { SLOT_SIZE(14, 1), 14, 1 },
> +       { SLOT_SIZE(12, 1), 12, 1 },
> +       { SLOT_SIZE(11, 1), 11, 1 },
> +       { SLOT_SIZE(10, 1), 10, 1 },
> +       { SLOT_SIZE(9, 1), 9, 1 },
> +       { SLOT_SIZE(8, 1), 8, 1 },
> +       { SLOT_SIZE(15, 2), 15, 2 },
> +       { SLOT_SIZE(14, 2), 14, 2 },
> +       { SLOT_SIZE(13, 2), 13, 2 },
> +       { SLOT_SIZE(12, 2), 12, 2 },
> +       { SLOT_SIZE(11, 2), 11, 2 },
> +       { SLOT_SIZE(10, 2), 10, 2 },
> +       { SLOT_SIZE(9, 2), 9, 2 },
> +       { SLOT_SIZE(8, 2), 8, 2 },
> +       { SLOT_SIZE(15, 3), 15, 3 },
> +       { SLOT_SIZE(14, 3), 14, 3 },
> +       { SLOT_SIZE(13, 3), 13, 3 },
> +       { SLOT_SIZE(12, 3), 12, 3 },
> +       { SLOT_SIZE(11, 3), 11, 3 },
> +       { SLOT_SIZE(10, 3), 10, 3 },
> +       { SLOT_SIZE(9, 3), 9, 3 },
> +       { SLOT_SIZE(7, 3), 7, 3 }
> +};
> +
> +/**
> + * struct block_list - stores metadata of particular list
> + * lock:               protects block_cache
> + * block_cache:                blocks with free slots
> + * block_count:                total number of blocks in the list
> + */
> +struct block_list {
> +       spinlock_t lock;
> +       struct zblock_block *block_cache[BLOCK_CACHE_SIZE];
> +       unsigned long block_count;
> +};
> +
> +/**
> + * struct zblock_pool - stores metadata for each zblock pool
> + * @block_lists:       array of block lists
> + * @zpool:             zpool driver
> + * @alloc_flag:                protects block allocation from memory leak
> + *
> + * This structure is allocated at pool creation time and maintains metadata
> + * for a particular zblock pool.
> + */
> +struct zblock_pool {
> +       struct block_list block_lists[ARRAY_SIZE(block_desc)];
> +       struct zpool *zpool;
> +       atomic_t alloc_flag;
> +};
> +
> +/*****************
> + * Helpers
> + *****************/
> +
> +static int cache_insert_block(struct zblock_block *block, struct block_list *list)
> +{
> +       unsigned int i, min_free_slots = atomic_read(&block->free_slots);
> +       int min_index = -1;
> +
> +       if (WARN_ON(block->cache_idx != -1))
> +               return -EINVAL;
> +
> +       min_free_slots = atomic_read(&block->free_slots);
> +       for (i = 0; i < BLOCK_CACHE_SIZE; i++) {
> +               if (!list->block_cache[i] || !atomic_read(&(list->block_cache[i])->free_slots)) {
> +                       min_index = i;
> +                       break;
> +               }
> +               if (atomic_read(&(list->block_cache[i])->free_slots) < min_free_slots) {
> +                       min_free_slots = atomic_read(&(list->block_cache[i])->free_slots);
> +                       min_index = i;
> +               }
> +       }
> +       if (min_index >= 0) {
> +               if (list->block_cache[min_index])
> +                       (list->block_cache[min_index])->cache_idx = -1;
> +               list->block_cache[min_index] = block;
> +               block->cache_idx = min_index;
> +       }
> +       return min_index < 0 ? min_index : 0;
> +}
> +
> +static struct zblock_block *cache_find_block(struct block_list *list)
> +{
> +       int i;
> +       struct zblock_block *z = NULL;
> +
> +       for (i = 0; i < BLOCK_CACHE_SIZE; i++) {
> +               if (list->block_cache[i] &&
> +                   atomic_dec_if_positive(&list->block_cache[i]->free_slots) >= 0) {
> +                       z = list->block_cache[i];
> +                       break;
> +               }
> +       }
> +       return z;
> +}
> +
> +static int cache_remove_block(struct block_list *list, struct zblock_block *block)
> +{
> +       int idx = block->cache_idx;
> +
> +       block->cache_idx = -1;
> +       if (idx >= 0)
> +               list->block_cache[idx] = NULL;
> +       return idx < 0 ? idx : 0;
> +}
> +
> +/*
> + * Encodes the handle of a particular slot in the pool using metadata
> + */
> +static inline unsigned long metadata_to_handle(struct zblock_block *block,
> +                                                       unsigned int block_type, unsigned int slot)
> +{
> +       return (unsigned long)(block) + (block_type << SLOT_BITS) + slot;
> +}
> +
> +/* Returns block, block type and slot in the pool corresponding to handle */
> +static inline struct zblock_block *handle_to_metadata(unsigned long handle,
> +                                               unsigned int *block_type, unsigned int *slot)
> +{
> +       *block_type = (handle & (PAGE_SIZE - 1)) >> SLOT_BITS;
> +       *slot = handle & SLOT_MASK;
> +       return (struct zblock_block *)(handle & PAGE_MASK);
> +}
> +
> +
> +/*
> + * allocate new block and add it to corresponding block list
> + */
> +static struct zblock_block *alloc_block(struct zblock_pool *pool,
> +                                       int block_type, gfp_t gfp,
> +                                       unsigned long *handle)
> +{
> +       struct zblock_block *block;
> +       struct block_list *list;
> +
> +       block = (void *)__get_free_pages(gfp, block_desc[block_type].order);
> +       if (!block)
> +               return NULL;
> +
> +       list = &(pool->block_lists)[block_type];
> +
> +       /* init block data  */
> +       memset(&block->slot_info, 0, sizeof(block->slot_info));
> +       atomic_set(&block->free_slots, block_desc[block_type].slots_per_block - 1);
> +       block->cache_idx = -1;
> +       set_bit(BIT_SLOT_OCCUPIED, (unsigned long *)block->slot_info);
> +       *handle = metadata_to_handle(block, block_type, 0);
> +
> +       spin_lock(&list->lock);
> +       cache_insert_block(block, list);
> +       list->block_count++;
> +       spin_unlock(&list->lock);
> +       return block;
> +}
> +
> +/*****************
> + * API Functions
> + *****************/
> +/**
> + * zblock_create_pool() - create a new zblock pool
> + * @gfp:       gfp flags when allocating the zblock pool structure
> + * @ops:       user-defined operations for the zblock pool
> + *
> + * Return: pointer to the new zblock pool or NULL if the metadata allocation
> + * failed.
> + */
> +static struct zblock_pool *zblock_create_pool(gfp_t gfp)
> +{
> +       struct zblock_pool *pool;
> +       struct block_list *list;
> +       int i, j;
> +
> +       pool = kmalloc(sizeof(struct zblock_pool), gfp);
> +       if (!pool)
> +               return NULL;
> +
> +       /* init each block list */
> +       for (i = 0; i < ARRAY_SIZE(block_desc); i++) {
> +               list = &(pool->block_lists)[i];
> +               spin_lock_init(&list->lock);
> +               for (j = 0; j < BLOCK_CACHE_SIZE; j++)
> +                       list->block_cache[j] = NULL;
> +               list->block_count = 0;
> +       }
> +       atomic_set(&pool->alloc_flag, 0);
> +       return pool;
> +}
> +
> +/**
> + * zblock_destroy_pool() - destroys an existing zblock pool
> + * @pool:      the zblock pool to be destroyed
> + *
> + */
> +static void zblock_destroy_pool(struct zblock_pool *pool)
> +{
> +       kfree(pool);
> +}
> +
> +
> +/**
> + * zblock_alloc() - allocates a slot of appropriate size
> + * @pool:      zblock pool from which to allocate
> + * @size:      size in bytes of the desired allocation
> + * @gfp:       gfp flags used if the pool needs to grow
> + * @handle:    handle of the new allocation
> + *
> + * Return: 0 if success and handle is set, otherwise -EINVAL if the size or
> + * gfp arguments are invalid or -ENOMEM if the pool was unable to allocate
> + * a new slot.
> + */
> +static int zblock_alloc(struct zblock_pool *pool, size_t size, gfp_t gfp,
> +                       unsigned long *handle)
> +{
> +       unsigned int block_type, slot;
> +       struct zblock_block *block;
> +       struct block_list *list;
> +
> +       if (!size)
> +               return -EINVAL;
> +
> +       if (size > PAGE_SIZE)
> +               return -ENOSPC;
> +
> +       /* find basic block type with suitable slot size */
> +       for (block_type = 0; block_type < ARRAY_SIZE(block_desc); block_type++) {
> +               if (size <= block_desc[block_type].slot_size)
> +                       break;
> +       }
> +       list = &(pool->block_lists[block_type]);
> +
> +check:
> +       /* check if there are free slots in cache */
> +       spin_lock(&list->lock);
> +       block = cache_find_block(list);
> +       spin_unlock(&list->lock);
> +       if (block)
> +               goto found;
> +
> +       /* not found block with free slots try to allocate new empty block */
> +       if (atomic_cmpxchg(&pool->alloc_flag, 0, 1))
> +               goto check;
> +       block = alloc_block(pool, block_type, gfp, handle);
> +       atomic_set(&pool->alloc_flag, 0);
> +       if (block)
> +               return 0;
> +       return -ENOMEM;
> +
> +found:
> +       /* find the first free slot in block */
> +       for (slot = 0; slot < block_desc[block_type].slots_per_block; slot++) {
> +               if (!test_and_set_bit(slot*2 + BIT_SLOT_OCCUPIED,
> +                                    (unsigned long *)&block->slot_info))
> +                       break;
> +       }
> +       *handle = metadata_to_handle(block, block_type, slot);
> +       return 0;
> +}
> +
> +/**
> + * zblock_free() - frees the allocation associated with the given handle
> + * @pool:      pool in which the allocation resided
> + * @handle:    handle associated with the allocation returned by zblock_alloc()
> + *
> + */
> +static void zblock_free(struct zblock_pool *pool, unsigned long handle)
> +{
> +       unsigned int slot, block_type;
> +       struct zblock_block *block;
> +       struct block_list *list;
> +
> +       block = handle_to_metadata(handle, &block_type, &slot);
> +       list = &(pool->block_lists[block_type]);
> +
> +       spin_lock(&list->lock);
> +       /* if all slots in block are empty delete whole block */
> +       if (atomic_inc_return(&block->free_slots) == block_desc[block_type].slots_per_block) {
> +               list->block_count--;
> +               cache_remove_block(list, block);
> +               spin_unlock(&list->lock);
> +               free_pages((unsigned long)block, block_desc[block_type].order);
> +               return;
> +       }
> +
> +       if (atomic_read(&block->free_slots) < block_desc[block_type].slots_per_block/2
> +                       && block->cache_idx == -1)
> +               cache_insert_block(block, list);
> +       spin_unlock(&list->lock);
> +
> +       clear_bit(slot*2 + BIT_SLOT_OCCUPIED, (unsigned long *)block->slot_info);
> +}
> +
> +/**
> + * zblock_map() - maps the allocation associated with the given handle
> + * @pool:      pool in which the allocation resides
> + * @handle:    handle associated with the allocation to be mapped
> + *
> + *
> + * Returns: a pointer to the mapped allocation
> + */
> +static void *zblock_map(struct zblock_pool *pool, unsigned long handle)
> +{
> +       unsigned int block_type, slot;
> +       struct zblock_block *block;
> +       void *p;
> +
> +       block = handle_to_metadata(handle, &block_type, &slot);
> +       p = (void *)block + ZBLOCK_HEADER_SIZE + slot * block_desc[block_type].slot_size;
> +       return p;
> +}
> +
> +/**
> + * zblock_unmap() - unmaps the allocation associated with the given handle
> + * @pool:      pool in which the allocation resides
> + * @handle:    handle associated with the allocation to be unmapped
> + */
> +static void zblock_unmap(struct zblock_pool *pool, unsigned long handle)
> +{
> +}
> +
> +/**
> + * zblock_get_total_pages() - gets the zblock pool size in pages
> + * @pool:      pool being queried
> + *
> + * Returns: size in bytes of the given pool.
> + */
> +static u64 zblock_get_total_pages(struct zblock_pool *pool)
> +{
> +       u64 total_size;
> +       int i;
> +
> +       total_size = 0;
> +       for (i = 0; i < ARRAY_SIZE(block_desc); i++)
> +               total_size += pool->block_lists[i].block_count << block_desc[i].order;
> +
> +       return total_size;
> +}
> +
> +/*****************
> + * zpool
> + ****************/
> +
> +static void *zblock_zpool_create(const char *name, gfp_t gfp)
> +{
> +       return zblock_create_pool(gfp);
> +}
> +
> +static void zblock_zpool_destroy(void *pool)
> +{
> +       zblock_destroy_pool(pool);
> +}
> +
> +static int zblock_zpool_malloc(void *pool, size_t size, gfp_t gfp,
> +                       unsigned long *handle)
> +{
> +       return zblock_alloc(pool, size, gfp, handle);
> +}
> +
> +static void zblock_zpool_free(void *pool, unsigned long handle)
> +{
> +       zblock_free(pool, handle);
> +}
> +
> +static void *zblock_zpool_map(void *pool, unsigned long handle,
> +                       enum zpool_mapmode mm)
> +{
> +       return zblock_map(pool, handle);
> +}
> +
> +static void zblock_zpool_unmap(void *pool, unsigned long handle)
> +{
> +       zblock_unmap(pool, handle);
> +}
> +
> +static u64 zblock_zpool_total_pages(void *pool)
> +{
> +       return zblock_get_total_pages(pool);
> +}
> +
> +static struct zpool_driver zblock_zpool_driver = {
> +       .type =         "zblock",
> +       .owner =        THIS_MODULE,
> +       .create =       zblock_zpool_create,
> +       .destroy =      zblock_zpool_destroy,
> +       .malloc =       zblock_zpool_malloc,
> +       .free =         zblock_zpool_free,
> +       .map =          zblock_zpool_map,
> +       .unmap =        zblock_zpool_unmap,
> +       .total_pages =  zblock_zpool_total_pages,
> +};
> +
> +MODULE_ALIAS("zpool-zblock");
> +
> +static int __init init_zblock(void)
> +{
> +       pr_info("loaded\n");
> +       zpool_register_driver(&zblock_zpool_driver);
> +       return 0;
> +}
> +
> +static void __exit exit_zblock(void)
> +{
> +       zpool_unregister_driver(&zblock_zpool_driver);
> +       pr_info("unloaded\n");
> +}
> +
> +module_init(init_zblock);
> +module_exit(exit_zblock);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Vitaly Wool <vitaly.wool@konsulko.com>");
> +MODULE_DESCRIPTION("Block allocator for compressed pages");
> --
> 2.39.2
>
>
Vitaly April 1, 2025, 9:44 p.m. UTC | #2
Hi Nhat,

On Tuesday, April 1, 2025 at 8:24:15 pm +02:00, Nhat Pham <nphamcs@gmail.com> wrote:
>> With zblock, it is possible to densely arrange objects of various sizes
>> resulting in low internal fragmentation. Also this allocator tries to
>> fill incomplete blocks instead of adding new ones, in many cases
>> providing a compression ratio substantially higher than z3fold and zbud
>> (though lower than zmalloc's).

>> Do we have data for comparison here?

Yeah, for instance we normally run *stress-ng --vm 4 --vm-bytes 4G --vm-keep --timeout 10m --metrics-brief*

zblock:
stress-ng: metrc: [499] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: metrc: [499] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: metrc: [499] vm 43755994 600.53 2382.34 15.19 72861.92 18250.50

zsmalloc:
stress-ng: metrc: [491] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: metrc: [491] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: metrc: [491] vm 41769859 601.37 2381.85 16.56 69457.56 17415.62

which gives just a little short of 5% of advantage for zblock.

>> with regard to average performance and worst execution times, thus
>> allowing for better response time and real-time characteristics of the
>> whole system.

>> By performance, do you mean latency or throughput or storage density?

Latency and throughput. It's hard to compete with zsmalloc in storage density indeed.

~Vitaly
Shakeel Butt April 1, 2025, 11:16 p.m. UTC | #3
Hi Vitaly,

On Tue, Apr 01, 2025 at 07:17:54PM +0200, Vitaly Wool wrote:
> zblock is a special purpose allocator for storing compressed pages.
> It stores integer number of compressed objects per its block. These
> blocks consist of several physical pages (2**n, i. e. 1/2/4/8).
> 
> With zblock, it is possible to densely arrange objects of various sizes
> resulting in low internal fragmentation. Also this allocator tries to
> fill incomplete blocks instead of adding new ones,  in many cases
> providing a compression ratio substantially higher than z3fold and zbud
> (though lower than zmalloc's).
> 
> zblock does not require MMU

Can you explain why not requiring MMU is important for your use-case?
Also what exactly is your use-case? Are you planning to use zblock
through zram or zswap or something new?

> to operate and also is superior to zsmalloc
> with regard to average performance and worst execution times, thus
> allowing for better response time and real-time characteristics of the
> whole system.
> 
> E. g. on a series of stress-ng tests run on a Raspberry Pi 5, we get
> 5-10% higher value for bogo ops/s in zblock/zsmalloc comparison.

Can you explain a bit more on this test? How is this test using zblock?

thanks,
Shakeel
igor.b@beldev.am April 2, 2025, 6:45 a.m. UTC | #4
Hi Shakeel,

2025-04-02 03:16 Shakeel Butt wrote:
> Hi Vitaly,
> 
> On Tue, Apr 01, 2025 at 07:17:54PM +0200, Vitaly Wool wrote:
>> zblock is a special purpose allocator for storing compressed pages.
>> It stores integer number of compressed objects per its block. These
>> blocks consist of several physical pages (2**n, i. e. 1/2/4/8).
>> 
>> With zblock, it is possible to densely arrange objects of various 
>> sizes
>> resulting in low internal fragmentation. Also this allocator tries to
>> fill incomplete blocks instead of adding new ones,  in many cases
>> providing a compression ratio substantially higher than z3fold and 
>> zbud
>> (though lower than zmalloc's).
>> 
>> zblock does not require MMU
> 
> Can you explain why not requiring MMU is important for your use-case?
> Also what exactly is your use-case? Are you planning to use zblock
> through zram or zswap or something new?

We have 2 use cases for a zpool backend: data centers (zswap) and small 
MMU less devices, this is where zram comes ito play. We know that zram 
over zpool patch is still out of tree but we also know the story behind 
that :)

>> to operate and also is superior to zsmalloc
>> with regard to average performance and worst execution times, thus
>> allowing for better response time and real-time characteristics of the
>> whole system.
>> 
>> E. g. on a series of stress-ng tests run on a Raspberry Pi 5, we get
>> 5-10% higher value for bogo ops/s in zblock/zsmalloc comparison.
> 
> Can you explain a bit more on this test? How is this test using zblock?

We're running stress-ng on a number of devices. The results published in 
the previous message are from the run on Raspberry Pi 5. zpool's 
settings used for comparison are lz4/80%.

Thanks,
Igor
kernel test robot April 2, 2025, 1:03 p.m. UTC | #5
Hi Vitaly,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on v6.14]
[cannot apply to akpm-mm/mm-everything next-20250402]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Vitaly-Wool/mm-add-zblock-allocator/20250402-011953
base:   linus/master
patch link:    https://lore.kernel.org/r/20250401171754.2686501-1-vitaly.wool%40konsulko.se
patch subject: [PATCH] mm: add zblock allocator
config: s390-allmodconfig (https://download.01.org/0day-ci/archive/20250402/202504022017.kHAz8bGB-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250402/202504022017.kHAz8bGB-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202504022017.kHAz8bGB-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/zblock.c:56: warning: Function parameter or struct member 'free_slots' not described in 'zblock_block'
>> mm/zblock.c:56: warning: Function parameter or struct member 'slot_info' not described in 'zblock_block'
>> mm/zblock.c:56: warning: Function parameter or struct member 'cache_idx' not described in 'zblock_block'
>> mm/zblock.c:102: warning: Function parameter or struct member 'slot_size' not described in 'block_desc'
>> mm/zblock.c:102: warning: Function parameter or struct member 'slots_per_block' not described in 'block_desc'
>> mm/zblock.c:102: warning: Function parameter or struct member 'order' not described in 'block_desc'
>> mm/zblock.c:102: warning: Function parameter or struct member 'block_desc' not described in 'block_desc'
>> mm/zblock.c:114: warning: Function parameter or struct member 'lock' not described in 'block_list'
>> mm/zblock.c:114: warning: Function parameter or struct member 'block_cache' not described in 'block_list'
>> mm/zblock.c:114: warning: Function parameter or struct member 'block_count' not described in 'block_list'
>> mm/zblock.c:249: warning: Excess function parameter 'ops' description in 'zblock_create_pool'


vim +56 mm/zblock.c

    42	
    43	/**
    44	 * struct zblock_block - block metadata
    45	 * Block consists of several (1/2/4/8) pages and contains fixed
    46	 * integer number of slots for allocating compressed pages.
    47	 *
    48	 * free_slots:	number of free slots in the block
    49	 * slot_info:	contains data about free/occupied slots
    50	 * cache_idx:	index of the block in cache
    51	 */
    52	struct zblock_block {
    53		atomic_t free_slots;
    54		u64 slot_info[1];
    55		int cache_idx;
  > 56	};
    57	
    58	/**
    59	 * struct block_desc - general metadata for block lists
    60	 * Each block list stores only blocks of corresponding type which means
    61	 * that all blocks in it have the same number and size of slots.
    62	 * All slots are aligned to size of long.
    63	 *
    64	 * slot_size:		size of slot for this list
    65	 * slots_per_block:	number of slots per block for this list
    66	 * order:		order for __get_free_pages
    67	 */
    68	static const struct block_desc {
    69		const unsigned int slot_size;
    70		const unsigned short slots_per_block;
    71		const unsigned short order;
    72	} block_desc[] = {
    73		{ SLOT_SIZE(32, 0), 32, 0 },
    74		{ SLOT_SIZE(22, 0), 22, 0 },
    75		{ SLOT_SIZE(17, 0), 17, 0 },
    76		{ SLOT_SIZE(13, 0), 13, 0 },
    77		{ SLOT_SIZE(11, 0), 11, 0 },
    78		{ SLOT_SIZE(9, 0), 9, 0 },
    79		{ SLOT_SIZE(8, 0), 8, 0 },
    80		{ SLOT_SIZE(14, 1), 14, 1 },
    81		{ SLOT_SIZE(12, 1), 12, 1 },
    82		{ SLOT_SIZE(11, 1), 11, 1 },
    83		{ SLOT_SIZE(10, 1), 10, 1 },
    84		{ SLOT_SIZE(9, 1), 9, 1 },
    85		{ SLOT_SIZE(8, 1), 8, 1 },
    86		{ SLOT_SIZE(15, 2), 15, 2 },
    87		{ SLOT_SIZE(14, 2), 14, 2 },
    88		{ SLOT_SIZE(13, 2), 13, 2 },
    89		{ SLOT_SIZE(12, 2), 12, 2 },
    90		{ SLOT_SIZE(11, 2), 11, 2 },
    91		{ SLOT_SIZE(10, 2), 10, 2 },
    92		{ SLOT_SIZE(9, 2), 9, 2 },
    93		{ SLOT_SIZE(8, 2), 8, 2 },
    94		{ SLOT_SIZE(15, 3), 15, 3 },
    95		{ SLOT_SIZE(14, 3), 14, 3 },
    96		{ SLOT_SIZE(13, 3), 13, 3 },
    97		{ SLOT_SIZE(12, 3), 12, 3 },
    98		{ SLOT_SIZE(11, 3), 11, 3 },
    99		{ SLOT_SIZE(10, 3), 10, 3 },
   100		{ SLOT_SIZE(9, 3), 9, 3 },
   101		{ SLOT_SIZE(7, 3), 7, 3 }
 > 102	};
   103	
   104	/**
   105	 * struct block_list - stores metadata of particular list
   106	 * lock:		protects block_cache
   107	 * block_cache:		blocks with free slots
   108	 * block_count:		total number of blocks in the list
   109	 */
   110	struct block_list {
   111		spinlock_t lock;
   112		struct zblock_block *block_cache[BLOCK_CACHE_SIZE];
   113		unsigned long block_count;
 > 114	};
   115	
   116	/**
   117	 * struct zblock_pool - stores metadata for each zblock pool
   118	 * @block_lists:	array of block lists
   119	 * @zpool:		zpool driver
   120	 * @alloc_flag:		protects block allocation from memory leak
   121	 *
   122	 * This structure is allocated at pool creation time and maintains metadata
   123	 * for a particular zblock pool.
   124	 */
   125	struct zblock_pool {
   126		struct block_list block_lists[ARRAY_SIZE(block_desc)];
   127		struct zpool *zpool;
   128		atomic_t alloc_flag;
   129	};
   130	
   131	/*****************
   132	 * Helpers
   133	 *****************/
   134	
   135	static int cache_insert_block(struct zblock_block *block, struct block_list *list)
   136	{
   137		unsigned int i, min_free_slots = atomic_read(&block->free_slots);
   138		int min_index = -1;
   139	
   140		if (WARN_ON(block->cache_idx != -1))
   141			return -EINVAL;
   142	
   143		min_free_slots = atomic_read(&block->free_slots);
   144		for (i = 0; i < BLOCK_CACHE_SIZE; i++) {
   145			if (!list->block_cache[i] || !atomic_read(&(list->block_cache[i])->free_slots)) {
   146				min_index = i;
   147				break;
   148			}
   149			if (atomic_read(&(list->block_cache[i])->free_slots) < min_free_slots) {
   150				min_free_slots = atomic_read(&(list->block_cache[i])->free_slots);
   151				min_index = i;
   152			}
   153		}
   154		if (min_index >= 0) {
   155			if (list->block_cache[min_index])
   156				(list->block_cache[min_index])->cache_idx = -1;
   157			list->block_cache[min_index] = block;
   158			block->cache_idx = min_index;
   159		}
   160		return min_index < 0 ? min_index : 0;
   161	}
   162	
   163	static struct zblock_block *cache_find_block(struct block_list *list)
   164	{
   165		int i;
   166		struct zblock_block *z = NULL;
   167	
   168		for (i = 0; i < BLOCK_CACHE_SIZE; i++) {
   169			if (list->block_cache[i] &&
   170			    atomic_dec_if_positive(&list->block_cache[i]->free_slots) >= 0) {
   171				z = list->block_cache[i];
   172				break;
   173			}
   174		}
   175		return z;
   176	}
   177	
   178	static int cache_remove_block(struct block_list *list, struct zblock_block *block)
   179	{
   180		int idx = block->cache_idx;
   181	
   182		block->cache_idx = -1;
   183		if (idx >= 0)
   184			list->block_cache[idx] = NULL;
   185		return idx < 0 ? idx : 0;
   186	}
   187	
   188	/*
   189	 * Encodes the handle of a particular slot in the pool using metadata
   190	 */
   191	static inline unsigned long metadata_to_handle(struct zblock_block *block,
   192								unsigned int block_type, unsigned int slot)
   193	{
   194		return (unsigned long)(block) + (block_type << SLOT_BITS) + slot;
   195	}
   196	
   197	/* Returns block, block type and slot in the pool corresponding to handle */
   198	static inline struct zblock_block *handle_to_metadata(unsigned long handle,
   199							unsigned int *block_type, unsigned int *slot)
   200	{
   201		*block_type = (handle & (PAGE_SIZE - 1)) >> SLOT_BITS;
   202		*slot = handle & SLOT_MASK;
   203		return (struct zblock_block *)(handle & PAGE_MASK);
   204	}
   205	
   206	
   207	/*
   208	 * allocate new block and add it to corresponding block list
   209	 */
   210	static struct zblock_block *alloc_block(struct zblock_pool *pool,
   211						int block_type, gfp_t gfp,
   212						unsigned long *handle)
   213	{
   214		struct zblock_block *block;
   215		struct block_list *list;
   216	
   217		block = (void *)__get_free_pages(gfp, block_desc[block_type].order);
   218		if (!block)
   219			return NULL;
   220	
   221		list = &(pool->block_lists)[block_type];
   222	
   223		/* init block data  */
   224		memset(&block->slot_info, 0, sizeof(block->slot_info));
   225		atomic_set(&block->free_slots, block_desc[block_type].slots_per_block - 1);
   226		block->cache_idx = -1;
   227		set_bit(BIT_SLOT_OCCUPIED, (unsigned long *)block->slot_info);
   228		*handle = metadata_to_handle(block, block_type, 0);
   229	
   230		spin_lock(&list->lock);
   231		cache_insert_block(block, list);
   232		list->block_count++;
   233		spin_unlock(&list->lock);
   234		return block;
   235	}
   236	
   237	/*****************
   238	 * API Functions
   239	 *****************/
   240	/**
   241	 * zblock_create_pool() - create a new zblock pool
   242	 * @gfp:	gfp flags when allocating the zblock pool structure
   243	 * @ops:	user-defined operations for the zblock pool
   244	 *
   245	 * Return: pointer to the new zblock pool or NULL if the metadata allocation
   246	 * failed.
   247	 */
   248	static struct zblock_pool *zblock_create_pool(gfp_t gfp)
 > 249	{
   250		struct zblock_pool *pool;
   251		struct block_list *list;
   252		int i, j;
   253	
   254		pool = kmalloc(sizeof(struct zblock_pool), gfp);
   255		if (!pool)
   256			return NULL;
   257	
   258		/* init each block list */
   259		for (i = 0; i < ARRAY_SIZE(block_desc); i++) {
   260			list = &(pool->block_lists)[i];
   261			spin_lock_init(&list->lock);
   262			for (j = 0; j < BLOCK_CACHE_SIZE; j++)
   263				list->block_cache[j] = NULL;
   264			list->block_count = 0;
   265		}
   266		atomic_set(&pool->alloc_flag, 0);
   267		return pool;
   268	}
   269
Shakeel Butt April 2, 2025, 4:24 p.m. UTC | #6
Hi Igor,

On Wed, Apr 02, 2025 at 10:45:25AM +0400, igor.b@beldev.am wrote:
> Hi Shakeel,
> 
> 2025-04-02 03:16 Shakeel Butt wrote:
> > Hi Vitaly,
> > 
> > On Tue, Apr 01, 2025 at 07:17:54PM +0200, Vitaly Wool wrote:
> > > zblock is a special purpose allocator for storing compressed pages.
> > > It stores integer number of compressed objects per its block. These
> > > blocks consist of several physical pages (2**n, i. e. 1/2/4/8).
> > > 
> > > With zblock, it is possible to densely arrange objects of various
> > > sizes
> > > resulting in low internal fragmentation. Also this allocator tries to
> > > fill incomplete blocks instead of adding new ones,  in many cases
> > > providing a compression ratio substantially higher than z3fold and
> > > zbud
> > > (though lower than zmalloc's).
> > > 
> > > zblock does not require MMU
> > 
> > Can you explain why not requiring MMU is important for your use-case?
> > Also what exactly is your use-case? Are you planning to use zblock
> > through zram or zswap or something new?
> 
> We have 2 use cases for a zpool backend: data centers (zswap) and small MMU
> less devices, this is where zram comes ito play. We know that zram over
> zpool patch is still out of tree but we also know the story behind that :)
>

So zswap and zram are the use-cases. It is usually frowned upon to add a
general purpose library without an actual user. So, I would recommend to
add a user along with the zblock.

> > > to operate and also is superior to zsmalloc
> > > with regard to average performance and worst execution times, thus
> > > allowing for better response time and real-time characteristics of the
> > > whole system.
> > > 
> > > E. g. on a series of stress-ng tests run on a Raspberry Pi 5, we get
> > > 5-10% higher value for bogo ops/s in zblock/zsmalloc comparison.
> > 
> > Can you explain a bit more on this test? How is this test using zblock?
> 
> We're running stress-ng on a number of devices. The results published in the
> previous message are from the run on Raspberry Pi 5. zpool's settings used
> for comparison are lz4/80%.
> 

My question was mainly if this is zswap or zram and I think it is zswap
based on the stress-ng command. For zswap, I would recommend to run perf
experiment on a server (or beefy desktop) as well. Also pick a normal
server like workload. I would be interested in looking at zswapin
latencies and zswapout throuput in addition to the metrics the workload
cares about.

> Thanks,
> Igor
>
diff mbox series

Patch

diff --git a/Documentation/mm/zblock.rst b/Documentation/mm/zblock.rst
new file mode 100644
index 000000000000..754b3dbb9e94
--- /dev/null
+++ b/Documentation/mm/zblock.rst
@@ -0,0 +1,22 @@ 
+======
+zblock
+======
+
+zblock is a special purpose allocator for storing compressed pages.
+It stores integer number of compressed objects per its block. These
+blocks consist of several physical pages (2**n, i. e. 1/2/4/8).
+
+With zblock, it is possible to densely arrange objects of various sizes
+resulting in low internal fragmentation. Also this allocator tries to
+fill incomplete blocks instead of adding new ones,  in many cases
+providing a compression ratio substantially higher than z3fold and zbud
+(though lower than zmalloc's).
+
+zblock does not require MMU to operate and also is superior to zsmalloc
+with regard to average performance and worst execution times, thus
+allowing for better response time and real-time characteristics of the
+whole system.
+
+E. g. on a series of stress-ng tests run on a Raspberry Pi 5, we get
+5-10% higher value for bogo ops/s in zblock/zsmalloc comparison.
+
diff --git a/MAINTAINERS b/MAINTAINERS
index 991a33bad10e..166e9bfa04dc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -26313,6 +26313,13 @@  F:	Documentation/networking/device_drivers/hamradio/z8530drv.rst
 F:	drivers/net/hamradio/*scc.c
 F:	drivers/net/hamradio/z8530.h
 
+ZBLOCK COMPRESSED SLAB MEMORY ALLOCATOR
+M:	Vitaly Wool <vitaly.wool@konsulko.se>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	Documentation/mm/zblock.rst
+F:	mm/zblock.c
+
 ZBUD COMPRESSED PAGE ALLOCATOR
 M:	Seth Jennings <sjenning@redhat.com>
 M:	Dan Streetman <ddstreet@ieee.org>
diff --git a/mm/Kconfig b/mm/Kconfig
index 1b501db06417..26b79e3c1300 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -193,6 +193,14 @@  config Z3FOLD_DEPRECATED
 	  page. It is a ZBUD derivative so the simplicity and determinism are
 	  still there.
 
+config ZBLOCK
+	tristate "Fast compression allocator with high density"
+	depends on ZPOOL
+	help
+	  A special purpose allocator for storing compressed pages.
+	  It is designed to store same size compressed pages in blocks of
+	  physical pages.
+
 config Z3FOLD
 	tristate
 	default y if Z3FOLD_DEPRECATED=y
diff --git a/mm/Makefile b/mm/Makefile
index 850386a67b3e..2018455b7baa 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -116,6 +116,7 @@  obj-$(CONFIG_ZPOOL)	+= zpool.o
 obj-$(CONFIG_ZBUD)	+= zbud.o
 obj-$(CONFIG_ZSMALLOC)	+= zsmalloc.o
 obj-$(CONFIG_Z3FOLD)	+= z3fold.o
+obj-$(CONFIG_ZBLOCK)	+= zblock.o
 obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
 obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_NUMA) += numa.o
diff --git a/mm/zblock.c b/mm/zblock.c
new file mode 100644
index 000000000000..a6778653c451
--- /dev/null
+++ b/mm/zblock.c
@@ -0,0 +1,492 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * zblock.c
+ *
+ * Author: Vitaly Wool <vitaly.wool@konsulko.com>
+ * Based on the work from Ananda Badmaev <a.badmaev@clicknet.pro>
+ * Copyright (C) 2022-2024, Konsulko AB.
+ *
+ * Zblock is a small object allocator with the intention to serve as a
+ * zpool backend. It operates on page blocks which consist of number
+ * of physical pages being a power of 2 and store integer number of
+ * compressed pages per block which results in determinism and simplicity.
+ *
+ * zblock doesn't export any API and is meant to be used via zpool API.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/atomic.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/preempt.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/zpool.h>
+
+#define SLOT_FREE 0
+#define BIT_SLOT_OCCUPIED 0
+#define BIT_SLOT_MAPPED 1
+
+#define SLOT_BITS 5
+#define MAX_SLOTS (1 << SLOT_BITS)
+#define SLOT_MASK ((0x1UL << SLOT_BITS) - 1)
+
+#define ZBLOCK_HEADER_SIZE	round_up(sizeof(struct zblock_block), sizeof(long))
+#define BLOCK_DATA_SIZE(order) ((PAGE_SIZE << order) - ZBLOCK_HEADER_SIZE)
+#define SLOT_SIZE(nslots, order) (round_down((BLOCK_DATA_SIZE(order) / nslots), sizeof(long)))
+
+#define BLOCK_CACHE_SIZE 32
+
+struct zblock_pool;
+
+/**
+ * struct zblock_block - block metadata
+ * Block consists of several (1/2/4/8) pages and contains fixed
+ * integer number of slots for allocating compressed pages.
+ *
+ * free_slots:	number of free slots in the block
+ * slot_info:	contains data about free/occupied slots
+ * cache_idx:	index of the block in cache
+ */
+struct zblock_block {
+	atomic_t free_slots;
+	u64 slot_info[1];
+	int cache_idx;
+};
+
+/**
+ * struct block_desc - general metadata for block lists
+ * Each block list stores only blocks of corresponding type which means
+ * that all blocks in it have the same number and size of slots.
+ * All slots are aligned to size of long.
+ *
+ * slot_size:		size of slot for this list
+ * slots_per_block:	number of slots per block for this list
+ * order:		order for __get_free_pages
+ */
+static const struct block_desc {
+	const unsigned int slot_size;
+	const unsigned short slots_per_block;
+	const unsigned short order;
+} block_desc[] = {
+	{ SLOT_SIZE(32, 0), 32, 0 },
+	{ SLOT_SIZE(22, 0), 22, 0 },
+	{ SLOT_SIZE(17, 0), 17, 0 },
+	{ SLOT_SIZE(13, 0), 13, 0 },
+	{ SLOT_SIZE(11, 0), 11, 0 },
+	{ SLOT_SIZE(9, 0), 9, 0 },
+	{ SLOT_SIZE(8, 0), 8, 0 },
+	{ SLOT_SIZE(14, 1), 14, 1 },
+	{ SLOT_SIZE(12, 1), 12, 1 },
+	{ SLOT_SIZE(11, 1), 11, 1 },
+	{ SLOT_SIZE(10, 1), 10, 1 },
+	{ SLOT_SIZE(9, 1), 9, 1 },
+	{ SLOT_SIZE(8, 1), 8, 1 },
+	{ SLOT_SIZE(15, 2), 15, 2 },
+	{ SLOT_SIZE(14, 2), 14, 2 },
+	{ SLOT_SIZE(13, 2), 13, 2 },
+	{ SLOT_SIZE(12, 2), 12, 2 },
+	{ SLOT_SIZE(11, 2), 11, 2 },
+	{ SLOT_SIZE(10, 2), 10, 2 },
+	{ SLOT_SIZE(9, 2), 9, 2 },
+	{ SLOT_SIZE(8, 2), 8, 2 },
+	{ SLOT_SIZE(15, 3), 15, 3 },
+	{ SLOT_SIZE(14, 3), 14, 3 },
+	{ SLOT_SIZE(13, 3), 13, 3 },
+	{ SLOT_SIZE(12, 3), 12, 3 },
+	{ SLOT_SIZE(11, 3), 11, 3 },
+	{ SLOT_SIZE(10, 3), 10, 3 },
+	{ SLOT_SIZE(9, 3), 9, 3 },
+	{ SLOT_SIZE(7, 3), 7, 3 }
+};
+
+/**
+ * struct block_list - stores metadata of particular list
+ * lock:		protects block_cache
+ * block_cache:		blocks with free slots
+ * block_count:		total number of blocks in the list
+ */
+struct block_list {
+	spinlock_t lock;
+	struct zblock_block *block_cache[BLOCK_CACHE_SIZE];
+	unsigned long block_count;
+};
+
+/**
+ * struct zblock_pool - stores metadata for each zblock pool
+ * @block_lists:	array of block lists
+ * @zpool:		zpool driver
+ * @alloc_flag:		protects block allocation from memory leak
+ *
+ * This structure is allocated at pool creation time and maintains metadata
+ * for a particular zblock pool.
+ */
+struct zblock_pool {
+	struct block_list block_lists[ARRAY_SIZE(block_desc)];
+	struct zpool *zpool;
+	atomic_t alloc_flag;
+};
+
+/*****************
+ * Helpers
+ *****************/
+
+static int cache_insert_block(struct zblock_block *block, struct block_list *list)
+{
+	unsigned int i, min_free_slots = atomic_read(&block->free_slots);
+	int min_index = -1;
+
+	if (WARN_ON(block->cache_idx != -1))
+		return -EINVAL;
+
+	min_free_slots = atomic_read(&block->free_slots);
+	for (i = 0; i < BLOCK_CACHE_SIZE; i++) {
+		if (!list->block_cache[i] || !atomic_read(&(list->block_cache[i])->free_slots)) {
+			min_index = i;
+			break;
+		}
+		if (atomic_read(&(list->block_cache[i])->free_slots) < min_free_slots) {
+			min_free_slots = atomic_read(&(list->block_cache[i])->free_slots);
+			min_index = i;
+		}
+	}
+	if (min_index >= 0) {
+		if (list->block_cache[min_index])
+			(list->block_cache[min_index])->cache_idx = -1;
+		list->block_cache[min_index] = block;
+		block->cache_idx = min_index;
+	}
+	return min_index < 0 ? min_index : 0;
+}
+
+static struct zblock_block *cache_find_block(struct block_list *list)
+{
+	int i;
+	struct zblock_block *z = NULL;
+
+	for (i = 0; i < BLOCK_CACHE_SIZE; i++) {
+		if (list->block_cache[i] &&
+		    atomic_dec_if_positive(&list->block_cache[i]->free_slots) >= 0) {
+			z = list->block_cache[i];
+			break;
+		}
+	}
+	return z;
+}
+
+static int cache_remove_block(struct block_list *list, struct zblock_block *block)
+{
+	int idx = block->cache_idx;
+
+	block->cache_idx = -1;
+	if (idx >= 0)
+		list->block_cache[idx] = NULL;
+	return idx < 0 ? idx : 0;
+}
+
+/*
+ * Encodes the handle of a particular slot in the pool using metadata
+ */
+static inline unsigned long metadata_to_handle(struct zblock_block *block,
+							unsigned int block_type, unsigned int slot)
+{
+	return (unsigned long)(block) + (block_type << SLOT_BITS) + slot;
+}
+
+/* Returns block, block type and slot in the pool corresponding to handle */
+static inline struct zblock_block *handle_to_metadata(unsigned long handle,
+						unsigned int *block_type, unsigned int *slot)
+{
+	*block_type = (handle & (PAGE_SIZE - 1)) >> SLOT_BITS;
+	*slot = handle & SLOT_MASK;
+	return (struct zblock_block *)(handle & PAGE_MASK);
+}
+
+
+/*
+ * allocate new block and add it to corresponding block list
+ */
+static struct zblock_block *alloc_block(struct zblock_pool *pool,
+					int block_type, gfp_t gfp,
+					unsigned long *handle)
+{
+	struct zblock_block *block;
+	struct block_list *list;
+
+	block = (void *)__get_free_pages(gfp, block_desc[block_type].order);
+	if (!block)
+		return NULL;
+
+	list = &(pool->block_lists)[block_type];
+
+	/* init block data  */
+	memset(&block->slot_info, 0, sizeof(block->slot_info));
+	atomic_set(&block->free_slots, block_desc[block_type].slots_per_block - 1);
+	block->cache_idx = -1;
+	set_bit(BIT_SLOT_OCCUPIED, (unsigned long *)block->slot_info);
+	*handle = metadata_to_handle(block, block_type, 0);
+
+	spin_lock(&list->lock);
+	cache_insert_block(block, list);
+	list->block_count++;
+	spin_unlock(&list->lock);
+	return block;
+}
+
+/*****************
+ * API Functions
+ *****************/
+/**
+ * zblock_create_pool() - create a new zblock pool
+ * @gfp:	gfp flags when allocating the zblock pool structure
+ * @ops:	user-defined operations for the zblock pool
+ *
+ * Return: pointer to the new zblock pool or NULL if the metadata allocation
+ * failed.
+ */
+static struct zblock_pool *zblock_create_pool(gfp_t gfp)
+{
+	struct zblock_pool *pool;
+	struct block_list *list;
+	int i, j;
+
+	pool = kmalloc(sizeof(struct zblock_pool), gfp);
+	if (!pool)
+		return NULL;
+
+	/* init each block list */
+	for (i = 0; i < ARRAY_SIZE(block_desc); i++) {
+		list = &(pool->block_lists)[i];
+		spin_lock_init(&list->lock);
+		for (j = 0; j < BLOCK_CACHE_SIZE; j++)
+			list->block_cache[j] = NULL;
+		list->block_count = 0;
+	}
+	atomic_set(&pool->alloc_flag, 0);
+	return pool;
+}
+
+/**
+ * zblock_destroy_pool() - destroys an existing zblock pool
+ * @pool:	the zblock pool to be destroyed
+ *
+ */
+static void zblock_destroy_pool(struct zblock_pool *pool)
+{
+	kfree(pool);
+}
+
+
+/**
+ * zblock_alloc() - allocates a slot of appropriate size
+ * @pool:	zblock pool from which to allocate
+ * @size:	size in bytes of the desired allocation
+ * @gfp:	gfp flags used if the pool needs to grow
+ * @handle:	handle of the new allocation
+ *
+ * Return: 0 if success and handle is set, otherwise -EINVAL if the size or
+ * gfp arguments are invalid or -ENOMEM if the pool was unable to allocate
+ * a new slot.
+ */
+static int zblock_alloc(struct zblock_pool *pool, size_t size, gfp_t gfp,
+			unsigned long *handle)
+{
+	unsigned int block_type, slot;
+	struct zblock_block *block;
+	struct block_list *list;
+
+	if (!size)
+		return -EINVAL;
+
+	if (size > PAGE_SIZE)
+		return -ENOSPC;
+
+	/* find basic block type with suitable slot size */
+	for (block_type = 0; block_type < ARRAY_SIZE(block_desc); block_type++) {
+		if (size <= block_desc[block_type].slot_size)
+			break;
+	}
+	list = &(pool->block_lists[block_type]);
+
+check:
+	/* check if there are free slots in cache */
+	spin_lock(&list->lock);
+	block = cache_find_block(list);
+	spin_unlock(&list->lock);
+	if (block)
+		goto found;
+
+	/* not found block with free slots try to allocate new empty block */
+	if (atomic_cmpxchg(&pool->alloc_flag, 0, 1))
+		goto check;
+	block = alloc_block(pool, block_type, gfp, handle);
+	atomic_set(&pool->alloc_flag, 0);
+	if (block)
+		return 0;
+	return -ENOMEM;
+
+found:
+	/* find the first free slot in block */
+	for (slot = 0; slot < block_desc[block_type].slots_per_block; slot++) {
+		if (!test_and_set_bit(slot*2 + BIT_SLOT_OCCUPIED,
+				     (unsigned long *)&block->slot_info))
+			break;
+	}
+	*handle = metadata_to_handle(block, block_type, slot);
+	return 0;
+}
+
+/**
+ * zblock_free() - frees the allocation associated with the given handle
+ * @pool:	pool in which the allocation resided
+ * @handle:	handle associated with the allocation returned by zblock_alloc()
+ *
+ */
+static void zblock_free(struct zblock_pool *pool, unsigned long handle)
+{
+	unsigned int slot, block_type;
+	struct zblock_block *block;
+	struct block_list *list;
+
+	block = handle_to_metadata(handle, &block_type, &slot);
+	list = &(pool->block_lists[block_type]);
+
+	spin_lock(&list->lock);
+	/* if all slots in block are empty delete whole block */
+	if (atomic_inc_return(&block->free_slots) == block_desc[block_type].slots_per_block) {
+		list->block_count--;
+		cache_remove_block(list, block);
+		spin_unlock(&list->lock);
+		free_pages((unsigned long)block, block_desc[block_type].order);
+		return;
+	}
+
+	if (atomic_read(&block->free_slots) < block_desc[block_type].slots_per_block/2
+			&& block->cache_idx == -1)
+		cache_insert_block(block, list);
+	spin_unlock(&list->lock);
+
+	clear_bit(slot*2 + BIT_SLOT_OCCUPIED, (unsigned long *)block->slot_info);
+}
+
+/**
+ * zblock_map() - maps the allocation associated with the given handle
+ * @pool:	pool in which the allocation resides
+ * @handle:	handle associated with the allocation to be mapped
+ *
+ *
+ * Returns: a pointer to the mapped allocation
+ */
+static void *zblock_map(struct zblock_pool *pool, unsigned long handle)
+{
+	unsigned int block_type, slot;
+	struct zblock_block *block;
+	void *p;
+
+	block = handle_to_metadata(handle, &block_type, &slot);
+	p = (void *)block + ZBLOCK_HEADER_SIZE + slot * block_desc[block_type].slot_size;
+	return p;
+}
+
+/**
+ * zblock_unmap() - unmaps the allocation associated with the given handle
+ * @pool:	pool in which the allocation resides
+ * @handle:	handle associated with the allocation to be unmapped
+ */
+static void zblock_unmap(struct zblock_pool *pool, unsigned long handle)
+{
+}
+
+/**
+ * zblock_get_total_pages() - gets the zblock pool size in pages
+ * @pool:	pool being queried
+ *
+ * Returns: size in bytes of the given pool.
+ */
+static u64 zblock_get_total_pages(struct zblock_pool *pool)
+{
+	u64 total_size;
+	int i;
+
+	total_size = 0;
+	for (i = 0; i < ARRAY_SIZE(block_desc); i++)
+		total_size += pool->block_lists[i].block_count << block_desc[i].order;
+
+	return total_size;
+}
+
+/*****************
+ * zpool
+ ****************/
+
+static void *zblock_zpool_create(const char *name, gfp_t gfp)
+{
+	return zblock_create_pool(gfp);
+}
+
+static void zblock_zpool_destroy(void *pool)
+{
+	zblock_destroy_pool(pool);
+}
+
+static int zblock_zpool_malloc(void *pool, size_t size, gfp_t gfp,
+			unsigned long *handle)
+{
+	return zblock_alloc(pool, size, gfp, handle);
+}
+
+static void zblock_zpool_free(void *pool, unsigned long handle)
+{
+	zblock_free(pool, handle);
+}
+
+static void *zblock_zpool_map(void *pool, unsigned long handle,
+			enum zpool_mapmode mm)
+{
+	return zblock_map(pool, handle);
+}
+
+static void zblock_zpool_unmap(void *pool, unsigned long handle)
+{
+	zblock_unmap(pool, handle);
+}
+
+static u64 zblock_zpool_total_pages(void *pool)
+{
+	return zblock_get_total_pages(pool);
+}
+
+static struct zpool_driver zblock_zpool_driver = {
+	.type =		"zblock",
+	.owner =	THIS_MODULE,
+	.create =	zblock_zpool_create,
+	.destroy =	zblock_zpool_destroy,
+	.malloc =	zblock_zpool_malloc,
+	.free =		zblock_zpool_free,
+	.map =		zblock_zpool_map,
+	.unmap =	zblock_zpool_unmap,
+	.total_pages =	zblock_zpool_total_pages,
+};
+
+MODULE_ALIAS("zpool-zblock");
+
+static int __init init_zblock(void)
+{
+	pr_info("loaded\n");
+	zpool_register_driver(&zblock_zpool_driver);
+	return 0;
+}
+
+static void __exit exit_zblock(void)
+{
+	zpool_unregister_driver(&zblock_zpool_driver);
+	pr_info("unloaded\n");
+}
+
+module_init(init_zblock);
+module_exit(exit_zblock);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Vitaly Wool <vitaly.wool@konsulko.com>");
+MODULE_DESCRIPTION("Block allocator for compressed pages");