diff mbox

[v1,04/18] util/rbcache: range-based cache core

Message ID 20161115063715.12561-5-pbutsykin@virtuozzo.com (mailing list archive)
State New, archived
Headers show

Commit Message

Pavel Butsykin Nov. 15, 2016, 6:37 a.m. UTC
RBCache provides functionality to cache the data from block devices
(basically). The range here is used as the main key for searching and storing
data. The cache is based on red-black trees, so basic operations search,
insert, delete are performed for O(log n).

It is important to note that QEMU usually does not require a data cache, but
in reality, there are already some cases where a cache of small amounts can
increase performance, so as the data structure was selected red-black trees,
this is a fairly simple data structure and show high efficiency on a small
number of elements. Therefore, when the minimum range is 512 bytes, the
recommended size of the cache memory no more than 8-16mb.  Also note
that this cache implementation allows to store ranges of different lengths
without alignment.

Generic cache core can easily be used to implement different caching policies at
the block level, such as read-ahed. Also it can be used in some special cases,
for example for caching data in qcow2 when sequential allocating writes to image
with backing file.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 MAINTAINERS            |   6 ++
 include/qemu/rbcache.h | 105 +++++++++++++++++++++
 util/Makefile.objs     |   1 +
 util/rbcache.c         | 246 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 358 insertions(+)
 create mode 100644 include/qemu/rbcache.h
 create mode 100644 util/rbcache.c

Comments

Kevin Wolf Nov. 23, 2016, 9:25 p.m. UTC | #1
Am 15.11.2016 um 07:37 hat Pavel Butsykin geschrieben:
> RBCache provides functionality to cache the data from block devices
> (basically). The range here is used as the main key for searching and storing
> data. The cache is based on red-black trees, so basic operations search,
> insert, delete are performed for O(log n).
> 
> It is important to note that QEMU usually does not require a data cache, but
> in reality, there are already some cases where a cache of small amounts can
> increase performance, so as the data structure was selected red-black trees,
> this is a fairly simple data structure and show high efficiency on a small
> number of elements. Therefore, when the minimum range is 512 bytes, the
> recommended size of the cache memory no more than 8-16mb.  Also note
> that this cache implementation allows to store ranges of different lengths
> without alignment.
> 
> Generic cache core can easily be used to implement different caching policies at
> the block level, such as read-ahed. Also it can be used in some special cases,
> for example for caching data in qcow2 when sequential allocating writes to image
> with backing file.
> 
> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
> ---
>  MAINTAINERS            |   6 ++
>  include/qemu/rbcache.h | 105 +++++++++++++++++++++
>  util/Makefile.objs     |   1 +
>  util/rbcache.c         | 246 +++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 358 insertions(+)
>  create mode 100644 include/qemu/rbcache.h
>  create mode 100644 util/rbcache.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index ddf797b..cb74802 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1365,6 +1365,12 @@ F: include/qemu/rbtree.h
>  F: include/qemu/rbtree_augmented.h
>  F: util/rbtree.c
>  
> +Range-Based Cache
> +M: Denis V. Lunev <den@openvz.org>
> +S: Supported
> +F: include/qemu/rbcache.h
> +F: util/rbcache.c
> +
>  UUID
>  M: Fam Zheng <famz@redhat.com>
>  S: Supported
> diff --git a/include/qemu/rbcache.h b/include/qemu/rbcache.h
> new file mode 100644
> index 0000000..c8f0a9f
> --- /dev/null
> +++ b/include/qemu/rbcache.h
> @@ -0,0 +1,105 @@
> +/*
> + * QEMU Range-Based Cache core
> + *
> + * Copyright (C) 2015-2016 Parallels IP Holdings GmbH. All rights reserved.
> + *
> + * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * later.  See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef RBCACHE_H
> +#define RBCACHE_H
> +
> +#include "qemu/rbtree.h"
> +#include "qemu/queue.h"
> +
> +typedef struct RBCacheNode {
> +    struct RbNode rb_node;
> +    uint64_t offset;
> +    uint64_t bytes;
> +    QTAILQ_ENTRY(RBCacheNode) entry;
> +} RBCacheNode;
> +
> +typedef struct RBCache RBCache;
> +
> +typedef RBCacheNode *RBNodeAlloc(uint64_t offset, uint64_t bytes, void *opaque);
> +typedef void RBNodeFree(RBCacheNode *node, void *opaque);

Maybe worth comments describing what these functions do apart from
g_new()/g_free()? I assume that offset and bytes must be initialised
from the parameters. Should rb_node and entry be zeroed?

> +
> +enum eviction_type {
> +    RBCACHE_FIFO,
> +    RBCACHE_LRU,
> +};
> +
> +/**
> + * rbcache_search:
> + * @rbcache: the cache object.
> + * @offset: the start of the range.
> + * @bytes: the size of the range.
> + *
> + * Returns the node corresponding to the range(offset, bytes),
> + * or NULL if the node was not found.
> + */
> +void *rbcache_search(RBCache *rbcache, uint64_t offset, uint64_t bytes);

What if the range covers multiple nodes? Is it defined which of the
nodes we return or do you just get any?

Why does this function (and the following ones) return void* rather than
RBCacheNode* if they are supposed to return a node?

> +/**
> + * rbcache_insert:
> + * @rbcache: the cache object.
> + * @node: a new node for the cache.
> + *
> + * Returns the new node, or old node if the node already exists.
> + */
> +void *rbcache_insert(RBCache *rbcache, RBCacheNode *node);

What does "if the node already exists" mean? If @node (the very same
object) is already stored in the cache object, or if a node describing
the same range already exists?

> +/**
> + * rbcache_search_and_insert:
> + * @rbcache: the cache object.
> + * @offset: the start of the range.
> + * @bytes: the size of the range.
> + *
> + * rbcache_search_and_insert() is like rbcache_insert(), except that a new node
> + * is allocated inside the function. Returns the new node, or old node if the
> + * node already exists.
> + */
> +void *rbcache_search_and_insert(RBCache *rbcache, uint64_t offset,
> +                                uint64_t byte);

What happens if a node exists, but only for part of the range?

> +/**
> + * rbcache_remove:
> + * @rbcache: the cache object.
> + * @node: the node to remove.
> + *
> + * Removes the cached range owned by the node.
> + */
> +void rbcache_remove(RBCache *rbcache, RBCacheNode *node);
> +
> +RBCacheNode *rbcache_node_alloc(RBCache *rbcache, uint64_t offset,
> +                                uint64_t bytes);
> +void rbcache_node_free(RBCache *rbcache, RBCacheNode *node);

No comment for these two?

> +
> +/**
> + * rbcache_create:
> + * @alloc: callback to allocation node, allows to upgrade allocate and expand
> + *         the capabilities of the node.
> + * @free: callback to release node, must be used together with alloc callback.
> + * @limit_size: maximum cache size in bytes.
> + * @eviction_type: method of memory limitation
> + * @opaque: the opaque pointer to pass to the callback.
> + *
> + * Returns the cache object.
> + */
> +RBCache *rbcache_create(RBNodeAlloc *alloc, RBNodeFree *free,
> +                        uint64_t limit_size, int eviction_type, void *opaque);
> +
> +/**
> + * rbcache_destroy:
> + * @rbcache: the cache object.
> + *
> + * Cleanup the cache object created with rbcache_create().
> + */
> +void rbcache_destroy(RBCache *rbcache);
> +
> +#endif /* RBCACHE_H */
> diff --git a/util/Makefile.objs b/util/Makefile.objs
> index 727a567..9fb0de6 100644
> --- a/util/Makefile.objs
> +++ b/util/Makefile.objs
> @@ -38,3 +38,4 @@ util-obj-y += qdist.o
>  util-obj-y += qht.o
>  util-obj-y += range.o
>  util-obj-y += rbtree.o
> +util-obj-y += rbcache.o
> diff --git a/util/rbcache.c b/util/rbcache.c
> new file mode 100644
> index 0000000..05cfa5a
> --- /dev/null
> +++ b/util/rbcache.c
> @@ -0,0 +1,246 @@
> +/*
> + * QEMU Range-Based Cache core
> + *
> + * Copyright (C) 2015-2016 Parallels IP Holdings GmbH. All rights reserved.
> + *
> + * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * later.  See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/rbcache.h"
> +
> +/* RBCache provides functionality to cache the data from block devices
> + * (basically). The range here is used as the main key for searching and storing
> + * data. The cache is based on red-black trees, so basic operations search,
> + * insert, delete are performed for O(log n).
> + *
> + * It is important to note that QEMU usually does not require a data cache, but
> + * in reality, there are already some cases where a cache of small amounts can
> + * increase performance, so as the data structure was selected red-black trees,
> + * this is a quite simple data structure and show high efficiency on a small
> + * number of elements. Therefore, when the minimum range is 512 bytes, the
> + * recommended size of the cache memory no more than 8-16mb. Also note that this
> + * cache implementation allows to store ranges of different lengths without
> + * alignment.
> + */
> +
> +struct RBCache {
> +    struct RbRoot root;
> +    RBNodeAlloc *alloc;
> +    RBNodeFree  *free;
> +    uint64_t limit_size;
> +    uint64_t cur_size;
> +    int eviction_type;

s/int/enum eviction_type/

> +    void *opaque;
> +
> +    QTAILQ_HEAD(RBCacheNodeHead, RBCacheNode) queue;
> +};
> +
> +static int node_key_cmp(const RBCacheNode *node1, const RBCacheNode *node2)
> +{
> +    assert(node1 != NULL);
> +    assert(node2 != NULL);
> +
> +    if (node1->offset >= node2->offset + node2->bytes) {
> +        return 1;
> +    }
> +    if (node1->offset + node1->bytes <= node2->offset) {
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +static RBCacheNode *node_previous(RBCacheNode* node, uint64_t target_offset)

It would be nice to describe this function in a comment.

From what I understand, it goes backwards in the tree until the last
node whose range end is larger than target_offset is reached.

> +{
> +    while (node) {
> +        struct RbNode *prev_rb_node = rb_prev(&node->rb_node);
> +        RBCacheNode *prev_node;
> +        if (prev_rb_node == NULL) {
> +            break;
> +        }
> +        prev_node = container_of(prev_rb_node, RBCacheNode, rb_node);
> +        if (prev_node->offset + prev_node->bytes <= target_offset) {
> +            break;
> +        }
> +        node = prev_node;
> +    }
> +
> +    assert(node != NULL);
> +
> +    return node;
> +}
> +
> +RBCacheNode *rbcache_node_alloc(RBCache *rbcache, uint64_t offset,
> +                                uint64_t bytes)
> +{
> +    RBCacheNode *node;
> +
> +    if (rbcache->alloc) {
> +        node = rbcache->alloc(offset, bytes, rbcache->opaque);
> +    } else {
> +        node = g_malloc(sizeof(*node));

g_new() is preferred.

> +    }
> +
> +    node->offset = offset;
> +    node->bytes  = bytes;

Okay, so the callback doesn't have to set offset/bytes because it's
already set here, and it can be NULL. Both things to be mentioned in the
documentation.

> +    return node;
> +}
> +
> +void rbcache_node_free(RBCache *rbcache, RBCacheNode *node)
> +{
> +    if (rbcache->free) {
> +        rbcache->free(node, rbcache->opaque);
> +    } else {
> +        g_free(node);
> +    }
> +}
> +
> +static void rbcache_try_shrink(RBCache *rbcache)
> +{
> +    while (rbcache->cur_size > rbcache->limit_size) {
> +        RBCacheNode *node;
> +        assert(!QTAILQ_EMPTY(&rbcache->queue));
> +
> +        node = QTAILQ_LAST(&rbcache->queue, RBCacheNodeHead);
> +
> +        rbcache_remove(rbcache, node);
> +    }
> +}
> +
> +static inline void node_move_in_queue(RBCache *rbcache, RBCacheNode *node)
> +{
> +    if (rbcache->eviction_type == RBCACHE_LRU) {
> +        QTAILQ_REMOVE(&rbcache->queue, node, entry);
> +        QTAILQ_INSERT_HEAD(&rbcache->queue, node, entry);
> +    }
> +}
> +
> +static RBCacheNode *node_insert(RBCache *rbcache, RBCacheNode *node, bool alloc)

Here as well there seem to be a few different cases and it would be good
to have a comment that specifies the behaviour for each of them. It
would also give me something to review against.

> +{
> +    struct RbNode **new, *parent = NULL;
> +
> +    assert(rbcache != NULL);
> +    assert(node->bytes != 0);
> +
> +    /* Figure out where to put new node */
> +    new = &(rbcache->root.rb_node);
> +    while (*new) {
> +        RBCacheNode *this = container_of(*new, RBCacheNode, rb_node);
> +        int result = node_key_cmp(node, this);
> +        if (result == 0) {
> +            this = node_previous(this, node->offset);

So in case of partial overlaps, the existing overlapping node with the
lowest offset is returned, but the range is not adjusted. Correct?

> +            node_move_in_queue(rbcache, this);
> +            return this;
> +        }
> +        parent = *new;
> +        new = result < 0 ? &((*new)->rb_left) : &((*new)->rb_right);
> +    }
> +
> +    if (alloc) {
> +        node = rbcache_node_alloc(rbcache, node->offset, node->bytes);
> +    }
> +    /* Add new node and rebalance tree. */
> +    rb_link_node(&node->rb_node, parent, new);
> +    rb_insert_color(&node->rb_node, &rbcache->root);
> +
> +    rbcache->cur_size += node->bytes;
> +
> +    rbcache_try_shrink(rbcache);
> +
> +    QTAILQ_INSERT_HEAD(&rbcache->queue, node, entry);
> +
> +    return node;
> +}
> +
> +void *rbcache_search(RBCache *rbcache, uint64_t offset, uint64_t bytes)
> +{
> +    struct RbNode *rb_node;
> +    RBCacheNode node = {
> +        .offset = offset,
> +        .bytes  = bytes,
> +    };
> +
> +    assert(rbcache != NULL);
> +
> +    rb_node = rbcache->root.rb_node;
> +    while (rb_node) {
> +        RBCacheNode *this = container_of(rb_node, RBCacheNode, rb_node);
> +        int32_t result = node_key_cmp(&node, this);
> +        if (result == 0) {
> +            this = node_previous(this, offset);
> +            node_move_in_queue(rbcache, this);
> +            return this;
> +        }
> +        rb_node = result < 0 ? rb_node->rb_left : rb_node->rb_right;
> +    }
> +    return NULL;
> +}
> +
> +void *rbcache_insert(RBCache *rbcache, RBCacheNode *node)
> +{
> +    return node_insert(rbcache, node, false);
> +}
> +
> +void *rbcache_search_and_insert(RBCache *rbcache, uint64_t offset,
> +                                uint64_t bytes)
> +{
> +    RBCacheNode node = {
> +        .offset = offset,
> +        .bytes  = bytes,
> +    };
> +
> +    return node_insert(rbcache, &node, true);
> +}
> +
> +void rbcache_remove(RBCache *rbcache, RBCacheNode *node)
> +{
> +    assert(rbcache->cur_size >= node->bytes);
> +
> +    rbcache->cur_size -= node->bytes;
> +    rb_erase(&node->rb_node, &rbcache->root);
> +
> +    QTAILQ_REMOVE(&rbcache->queue, node, entry);
> +
> +    rbcache_node_free(rbcache, node);
> +}
> +
> +RBCache *rbcache_create(RBNodeAlloc *alloc, RBNodeFree *free,
> +                        uint64_t limit_size, int eviction_type, void *opaque)
> +{
> +    RBCache *rbcache = g_malloc(sizeof(*rbcache));

Again, g_new() is preferred.

> +
> +    /* We can't use only one callback, or both or neither */
> +    assert(!(!alloc ^ !free));
> +
> +    rbcache->root   = RB_ROOT;
> +    rbcache->alloc  = alloc;
> +    rbcache->free   = free;
> +    rbcache->opaque = opaque;
> +    rbcache->limit_size = limit_size;
> +    rbcache->cur_size   = 0;
> +    rbcache->eviction_type = eviction_type;
> +
> +    QTAILQ_INIT(&rbcache->queue);

And actually, it would be good to make sure that all fields are zeroed
even if the struct is extended. You can do that with g_new0(), or -
that's what I would do - write the initialisation like this:

    *rbcache = (RBCache) {
        .root       = RB_ROOT,
        .alloc      = alloc,
        ...
    };

If you align all values on a column (which is appreciated), please do so
consistently and align everything to the same column as the longest
field name (eviction_type).

> +
> +    return rbcache;
> +}
> +
> +void rbcache_destroy(RBCache *rbcache)
> +{
> +    RBCacheNode *node, *next;
> +
> +    assert(rbcache != NULL);
> +
> +    QTAILQ_FOREACH_SAFE(node, &rbcache->queue, entry, next) {
> +        QTAILQ_REMOVE(&rbcache->queue, node, entry);
> +        rbcache_node_free(rbcache, node);
> +    }
> +
> +    g_free(rbcache);
> +}

Kevin
Pavel Butsykin Nov. 24, 2016, 7:23 p.m. UTC | #2
On 24.11.2016 00:25, Kevin Wolf wrote:
> Am 15.11.2016 um 07:37 hat Pavel Butsykin geschrieben:
>> RBCache provides functionality to cache the data from block devices
>> (basically). The range here is used as the main key for searching and storing
>> data. The cache is based on red-black trees, so basic operations search,
>> insert, delete are performed for O(log n).
>>
>> It is important to note that QEMU usually does not require a data cache, but
>> in reality, there are already some cases where a cache of small amounts can
>> increase performance, so as the data structure was selected red-black trees,
>> this is a fairly simple data structure and show high efficiency on a small
>> number of elements. Therefore, when the minimum range is 512 bytes, the
>> recommended size of the cache memory no more than 8-16mb.  Also note
>> that this cache implementation allows to store ranges of different lengths
>> without alignment.
>>
>> Generic cache core can easily be used to implement different caching policies at
>> the block level, such as read-ahed. Also it can be used in some special cases,
>> for example for caching data in qcow2 when sequential allocating writes to image
>> with backing file.
>>
>> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
>> ---
>>   MAINTAINERS            |   6 ++
>>   include/qemu/rbcache.h | 105 +++++++++++++++++++++
>>   util/Makefile.objs     |   1 +
>>   util/rbcache.c         | 246 +++++++++++++++++++++++++++++++++++++++++++++++++
>>   4 files changed, 358 insertions(+)
>>   create mode 100644 include/qemu/rbcache.h
>>   create mode 100644 util/rbcache.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index ddf797b..cb74802 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1365,6 +1365,12 @@ F: include/qemu/rbtree.h
>>   F: include/qemu/rbtree_augmented.h
>>   F: util/rbtree.c
>>
>> +Range-Based Cache
>> +M: Denis V. Lunev <den@openvz.org>
>> +S: Supported
>> +F: include/qemu/rbcache.h
>> +F: util/rbcache.c
>> +
>>   UUID
>>   M: Fam Zheng <famz@redhat.com>
>>   S: Supported
>> diff --git a/include/qemu/rbcache.h b/include/qemu/rbcache.h
>> new file mode 100644
>> index 0000000..c8f0a9f
>> --- /dev/null
>> +++ b/include/qemu/rbcache.h
>> @@ -0,0 +1,105 @@
>> +/*
>> + * QEMU Range-Based Cache core
>> + *
>> + * Copyright (C) 2015-2016 Parallels IP Holdings GmbH. All rights reserved.
>> + *
>> + * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> + * later.  See the COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef RBCACHE_H
>> +#define RBCACHE_H
>> +
>> +#include "qemu/rbtree.h"
>> +#include "qemu/queue.h"
>> +
>> +typedef struct RBCacheNode {
>> +    struct RbNode rb_node;
>> +    uint64_t offset;
>> +    uint64_t bytes;
>> +    QTAILQ_ENTRY(RBCacheNode) entry;
>> +} RBCacheNode;
>> +
>> +typedef struct RBCache RBCache;
>> +
>> +typedef RBCacheNode *RBNodeAlloc(uint64_t offset, uint64_t bytes, void *opaque);
>> +typedef void RBNodeFree(RBCacheNode *node, void *opaque);
>
> Maybe worth comments describing what these functions do apart from
> g_new()/g_free()? I assume that offset and bytes must be initialised
> from the parameters. Should rb_node and entry be zeroed?

This callback is called inside rbcache and serves to expand common
structure of a node descriptor without modifying code. Offset and size
here as the information about node, which actually may not be
necessary, as for example the offset in pcache_node_alloc(). rb_node
and entry similar to private fields that are initialized when node is
added to rbtree and these fields are absolutely not needed outside. But
maybe for extra precaution we can use g_new0().

>> +
>> +enum eviction_type {
>> +    RBCACHE_FIFO,
>> +    RBCACHE_LRU,
>> +};
>> +
>> +/**
>> + * rbcache_search:
>> + * @rbcache: the cache object.
>> + * @offset: the start of the range.
>> + * @bytes: the size of the range.
>> + *
>> + * Returns the node corresponding to the range(offset, bytes),
>> + * or NULL if the node was not found.
>> + */
>> +void *rbcache_search(RBCache *rbcache, uint64_t offset, uint64_t bytes);
>
> What if the range covers multiple nodes? Is it defined which of the
> nodes we return or do you just get any?

Yes, it's defined, returns the node with the lowest offset.

> Why does this function (and the following ones) return void* rather than
> RBCacheNode* if they are supposed to return a node?

This is done to get rid of the extra container_of() when using the
extended descriptor of the node, as for example PCacheNode. Is it not
acceptable?

>> +/**
>> + * rbcache_insert:
>> + * @rbcache: the cache object.
>> + * @node: a new node for the cache.
>> + *
>> + * Returns the new node, or old node if the node already exists.
>> + */
>> +void *rbcache_insert(RBCache *rbcache, RBCacheNode *node);
>
> What does "if the node already exists" mean? If @node (the very same
> object) is already stored in the cache object, or if a node describing
> the same range already exists?

if a node describing the same range already exists

>> +/**
>> + * rbcache_search_and_insert:
>> + * @rbcache: the cache object.
>> + * @offset: the start of the range.
>> + * @bytes: the size of the range.
>> + *
>> + * rbcache_search_and_insert() is like rbcache_insert(), except that a new node
>> + * is allocated inside the function. Returns the new node, or old node if the
>> + * node already exists.
>> + */
>> +void *rbcache_search_and_insert(RBCache *rbcache, uint64_t offset,
>> +                                uint64_t byte);
>
> What happens if a node exists, but only for part of the range?


In this case, it returns the existing node.

>> +/**
>> + * rbcache_remove:
>> + * @rbcache: the cache object.
>> + * @node: the node to remove.
>> + *
>> + * Removes the cached range owned by the node.
>> + */
>> +void rbcache_remove(RBCache *rbcache, RBCacheNode *node);
>> +
>> +RBCacheNode *rbcache_node_alloc(RBCache *rbcache, uint64_t offset,
>> +                                uint64_t bytes);
>> +void rbcache_node_free(RBCache *rbcache, RBCacheNode *node);
>
> No comment for these two?

will add.

>> +
>> +/**
>> + * rbcache_create:
>> + * @alloc: callback to allocation node, allows to upgrade allocate and expand
>> + *         the capabilities of the node.
>> + * @free: callback to release node, must be used together with alloc callback.
>> + * @limit_size: maximum cache size in bytes.
>> + * @eviction_type: method of memory limitation
>> + * @opaque: the opaque pointer to pass to the callback.
>> + *
>> + * Returns the cache object.
>> + */
>> +RBCache *rbcache_create(RBNodeAlloc *alloc, RBNodeFree *free,
>> +                        uint64_t limit_size, int eviction_type, void *opaque);
>> +
>> +/**
>> + * rbcache_destroy:
>> + * @rbcache: the cache object.
>> + *
>> + * Cleanup the cache object created with rbcache_create().
>> + */
>> +void rbcache_destroy(RBCache *rbcache);
>> +
>> +#endif /* RBCACHE_H */
>> diff --git a/util/Makefile.objs b/util/Makefile.objs
>> index 727a567..9fb0de6 100644
>> --- a/util/Makefile.objs
>> +++ b/util/Makefile.objs
>> @@ -38,3 +38,4 @@ util-obj-y += qdist.o
>>   util-obj-y += qht.o
>>   util-obj-y += range.o
>>   util-obj-y += rbtree.o
>> +util-obj-y += rbcache.o
>> diff --git a/util/rbcache.c b/util/rbcache.c
>> new file mode 100644
>> index 0000000..05cfa5a
>> --- /dev/null
>> +++ b/util/rbcache.c
>> @@ -0,0 +1,246 @@
>> +/*
>> + * QEMU Range-Based Cache core
>> + *
>> + * Copyright (C) 2015-2016 Parallels IP Holdings GmbH. All rights reserved.
>> + *
>> + * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> + * later.  See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/rbcache.h"
>> +
>> +/* RBCache provides functionality to cache the data from block devices
>> + * (basically). The range here is used as the main key for searching and storing
>> + * data. The cache is based on red-black trees, so basic operations search,
>> + * insert, delete are performed for O(log n).
>> + *
>> + * It is important to note that QEMU usually does not require a data cache, but
>> + * in reality, there are already some cases where a cache of small amounts can
>> + * increase performance, so as the data structure was selected red-black trees,
>> + * this is a quite simple data structure and show high efficiency on a small
>> + * number of elements. Therefore, when the minimum range is 512 bytes, the
>> + * recommended size of the cache memory no more than 8-16mb. Also note that this
>> + * cache implementation allows to store ranges of different lengths without
>> + * alignment.
>> + */
>> +
>> +struct RBCache {
>> +    struct RbRoot root;
>> +    RBNodeAlloc *alloc;
>> +    RBNodeFree  *free;
>> +    uint64_t limit_size;
>> +    uint64_t cur_size;
>> +    int eviction_type;
>
> s/int/enum eviction_type/
>
>> +    void *opaque;
>> +
>> +    QTAILQ_HEAD(RBCacheNodeHead, RBCacheNode) queue;
>> +};
>> +
>> +static int node_key_cmp(const RBCacheNode *node1, const RBCacheNode *node2)
>> +{
>> +    assert(node1 != NULL);
>> +    assert(node2 != NULL);
>> +
>> +    if (node1->offset >= node2->offset + node2->bytes) {
>> +        return 1;
>> +    }
>> +    if (node1->offset + node1->bytes <= node2->offset) {
>> +        return -1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static RBCacheNode *node_previous(RBCacheNode* node, uint64_t target_offset)
>
> It would be nice to describe this function in a comment.

Ok.

>  From what I understand, it goes backwards in the tree until the last
> node whose range end is larger than target_offset is reached.

Right.

>> +{
>> +    while (node) {
>> +        struct RbNode *prev_rb_node = rb_prev(&node->rb_node);
>> +        RBCacheNode *prev_node;
>> +        if (prev_rb_node == NULL) {
>> +            break;
>> +        }
>> +        prev_node = container_of(prev_rb_node, RBCacheNode, rb_node);
>> +        if (prev_node->offset + prev_node->bytes <= target_offset) {
>> +            break;
>> +        }
>> +        node = prev_node;
>> +    }
>> +
>> +    assert(node != NULL);
>> +
>> +    return node;
>> +}
>> +
>> +RBCacheNode *rbcache_node_alloc(RBCache *rbcache, uint64_t offset,
>> +                                uint64_t bytes)
>> +{
>> +    RBCacheNode *node;
>> +
>> +    if (rbcache->alloc) {
>> +        node = rbcache->alloc(offset, bytes, rbcache->opaque);
>> +    } else {
>> +        node = g_malloc(sizeof(*node));
>
> g_new() is preferred.
>
>> +    }
>> +
>> +    node->offset = offset;
>> +    node->bytes  = bytes;
>
> Okay, so the callback doesn't have to set offset/bytes because it's
> already set here, and it can be NULL. Both things to be mentioned in the
> documentation.
>
>> +    return node;
>> +}
>> +
>> +void rbcache_node_free(RBCache *rbcache, RBCacheNode *node)
>> +{
>> +    if (rbcache->free) {
>> +        rbcache->free(node, rbcache->opaque);
>> +    } else {
>> +        g_free(node);
>> +    }
>> +}
>> +
>> +static void rbcache_try_shrink(RBCache *rbcache)
>> +{
>> +    while (rbcache->cur_size > rbcache->limit_size) {
>> +        RBCacheNode *node;
>> +        assert(!QTAILQ_EMPTY(&rbcache->queue));
>> +
>> +        node = QTAILQ_LAST(&rbcache->queue, RBCacheNodeHead);
>> +
>> +        rbcache_remove(rbcache, node);
>> +    }
>> +}
>> +
>> +static inline void node_move_in_queue(RBCache *rbcache, RBCacheNode *node)
>> +{
>> +    if (rbcache->eviction_type == RBCACHE_LRU) {
>> +        QTAILQ_REMOVE(&rbcache->queue, node, entry);
>> +        QTAILQ_INSERT_HEAD(&rbcache->queue, node, entry);
>> +    }
>> +}
>> +
>> +static RBCacheNode *node_insert(RBCache *rbcache, RBCacheNode *node, bool alloc)
>
> Here as well there seem to be a few different cases and it would be good
> to have a comment that specifies the behaviour for each of them. It
> would also give me something to review against.

Ok, I'll try to make more comments. It seems I got used to the code and
for me the issue became less noticeable.

>> +{
>> +    struct RbNode **new, *parent = NULL;
>> +
>> +    assert(rbcache != NULL);
>> +    assert(node->bytes != 0);
>> +
>> +    /* Figure out where to put new node */
>> +    new = &(rbcache->root.rb_node);
>> +    while (*new) {
>> +        RBCacheNode *this = container_of(*new, RBCacheNode, rb_node);
>> +        int result = node_key_cmp(node, this);
>> +        if (result == 0) {
>> +            this = node_previous(this, node->offset);
>
> So in case of partial overlaps, the existing overlapping node with the
> lowest offset is returned, but the range is not adjusted. Correct?

It's correct. RBCache cannot store overlapping nodes. But if we want to
fill in the missing ranges, well we can do it based on the returned
nodes (see pickup_parts_of_cache()).

>> +            node_move_in_queue(rbcache, this);
>> +            return this;
>> +        }
>> +        parent = *new;
>> +        new = result < 0 ? &((*new)->rb_left) : &((*new)->rb_right);
>> +    }
>> +
>> +    if (alloc) {
>> +        node = rbcache_node_alloc(rbcache, node->offset, node->bytes);
>> +    }
>> +    /* Add new node and rebalance tree. */
>> +    rb_link_node(&node->rb_node, parent, new);
>> +    rb_insert_color(&node->rb_node, &rbcache->root);
>> +
>> +    rbcache->cur_size += node->bytes;
>> +
>> +    rbcache_try_shrink(rbcache);
>> +
>> +    QTAILQ_INSERT_HEAD(&rbcache->queue, node, entry);
>> +
>> +    return node;
>> +}
>> +
>> +void *rbcache_search(RBCache *rbcache, uint64_t offset, uint64_t bytes)
>> +{
>> +    struct RbNode *rb_node;
>> +    RBCacheNode node = {
>> +        .offset = offset,
>> +        .bytes  = bytes,
>> +    };
>> +
>> +    assert(rbcache != NULL);
>> +
>> +    rb_node = rbcache->root.rb_node;
>> +    while (rb_node) {
>> +        RBCacheNode *this = container_of(rb_node, RBCacheNode, rb_node);
>> +        int32_t result = node_key_cmp(&node, this);
>> +        if (result == 0) {
>> +            this = node_previous(this, offset);
>> +            node_move_in_queue(rbcache, this);
>> +            return this;
>> +        }
>> +        rb_node = result < 0 ? rb_node->rb_left : rb_node->rb_right;
>> +    }
>> +    return NULL;
>> +}
>> +
>> +void *rbcache_insert(RBCache *rbcache, RBCacheNode *node)
>> +{
>> +    return node_insert(rbcache, node, false);
>> +}
>> +
>> +void *rbcache_search_and_insert(RBCache *rbcache, uint64_t offset,
>> +                                uint64_t bytes)
>> +{
>> +    RBCacheNode node = {
>> +        .offset = offset,
>> +        .bytes  = bytes,
>> +    };
>> +
>> +    return node_insert(rbcache, &node, true);
>> +}
>> +
>> +void rbcache_remove(RBCache *rbcache, RBCacheNode *node)
>> +{
>> +    assert(rbcache->cur_size >= node->bytes);
>> +
>> +    rbcache->cur_size -= node->bytes;
>> +    rb_erase(&node->rb_node, &rbcache->root);
>> +
>> +    QTAILQ_REMOVE(&rbcache->queue, node, entry);
>> +
>> +    rbcache_node_free(rbcache, node);
>> +}
>> +
>> +RBCache *rbcache_create(RBNodeAlloc *alloc, RBNodeFree *free,
>> +                        uint64_t limit_size, int eviction_type, void *opaque)
>> +{
>> +    RBCache *rbcache = g_malloc(sizeof(*rbcache));
>
> Again, g_new() is preferred.
>
>> +
>> +    /* We can't use only one callback, or both or neither */
>> +    assert(!(!alloc ^ !free));
>> +
>> +    rbcache->root   = RB_ROOT;
>> +    rbcache->alloc  = alloc;
>> +    rbcache->free   = free;
>> +    rbcache->opaque = opaque;
>> +    rbcache->limit_size = limit_size;
>> +    rbcache->cur_size   = 0;
>> +    rbcache->eviction_type = eviction_type;
>> +
>> +    QTAILQ_INIT(&rbcache->queue);
>
> And actually, it would be good to make sure that all fields are zeroed
> even if the struct is extended. You can do that with g_new0(), or -
> that's what I would do - write the initialisation like this:
>
>      *rbcache = (RBCache) {
>          .root       = RB_ROOT,
>          .alloc      = alloc,
>          ...
>      };

Yes, I like the second variant.

> If you align all values on a column (which is appreciated), please do so
> consistently and align everything to the same column as the longest
> field name (eviction_type).

Thank's, I will consider it.

>> +
>> +    return rbcache;
>> +}
>> +
>> +void rbcache_destroy(RBCache *rbcache)
>> +{
>> +    RBCacheNode *node, *next;
>> +
>> +    assert(rbcache != NULL);
>> +
>> +    QTAILQ_FOREACH_SAFE(node, &rbcache->queue, entry, next) {
>> +        QTAILQ_REMOVE(&rbcache->queue, node, entry);
>> +        rbcache_node_free(rbcache, node);
>> +    }
>> +
>> +    g_free(rbcache);
>> +}

What do you think overall about this? Is it suitable for your the data
cache in qcow2?

> Kevin
>
diff mbox

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index ddf797b..cb74802 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1365,6 +1365,12 @@  F: include/qemu/rbtree.h
 F: include/qemu/rbtree_augmented.h
 F: util/rbtree.c
 
+Range-Based Cache
+M: Denis V. Lunev <den@openvz.org>
+S: Supported
+F: include/qemu/rbcache.h
+F: util/rbcache.c
+
 UUID
 M: Fam Zheng <famz@redhat.com>
 S: Supported
diff --git a/include/qemu/rbcache.h b/include/qemu/rbcache.h
new file mode 100644
index 0000000..c8f0a9f
--- /dev/null
+++ b/include/qemu/rbcache.h
@@ -0,0 +1,105 @@ 
+/*
+ * QEMU Range-Based Cache core
+ *
+ * Copyright (C) 2015-2016 Parallels IP Holdings GmbH. All rights reserved.
+ *
+ * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#ifndef RBCACHE_H
+#define RBCACHE_H
+
+#include "qemu/rbtree.h"
+#include "qemu/queue.h"
+
+typedef struct RBCacheNode {
+    struct RbNode rb_node;
+    uint64_t offset;
+    uint64_t bytes;
+    QTAILQ_ENTRY(RBCacheNode) entry;
+} RBCacheNode;
+
+typedef struct RBCache RBCache;
+
+typedef RBCacheNode *RBNodeAlloc(uint64_t offset, uint64_t bytes, void *opaque);
+typedef void RBNodeFree(RBCacheNode *node, void *opaque);
+
+
+enum eviction_type {
+    RBCACHE_FIFO,
+    RBCACHE_LRU,
+};
+
+/**
+ * rbcache_search:
+ * @rbcache: the cache object.
+ * @offset: the start of the range.
+ * @bytes: the size of the range.
+ *
+ * Returns the node corresponding to the range(offset, bytes),
+ * or NULL if the node was not found.
+ */
+void *rbcache_search(RBCache *rbcache, uint64_t offset, uint64_t bytes);
+
+/**
+ * rbcache_insert:
+ * @rbcache: the cache object.
+ * @node: a new node for the cache.
+ *
+ * Returns the new node, or old node if the node already exists.
+ */
+void *rbcache_insert(RBCache *rbcache, RBCacheNode *node);
+
+/**
+ * rbcache_search_and_insert:
+ * @rbcache: the cache object.
+ * @offset: the start of the range.
+ * @bytes: the size of the range.
+ *
+ * rbcache_search_and_insert() is like rbcache_insert(), except that a new node
+ * is allocated inside the function. Returns the new node, or old node if the
+ * node already exists.
+ */
+void *rbcache_search_and_insert(RBCache *rbcache, uint64_t offset,
+                                uint64_t byte);
+
+/**
+ * rbcache_remove:
+ * @rbcache: the cache object.
+ * @node: the node to remove.
+ *
+ * Removes the cached range owned by the node.
+ */
+void rbcache_remove(RBCache *rbcache, RBCacheNode *node);
+
+RBCacheNode *rbcache_node_alloc(RBCache *rbcache, uint64_t offset,
+                                uint64_t bytes);
+
+void rbcache_node_free(RBCache *rbcache, RBCacheNode *node);
+
+/**
+ * rbcache_create:
+ * @alloc: callback to allocation node, allows to upgrade allocate and expand
+ *         the capabilities of the node.
+ * @free: callback to release node, must be used together with alloc callback.
+ * @limit_size: maximum cache size in bytes.
+ * @eviction_type: method of memory limitation
+ * @opaque: the opaque pointer to pass to the callback.
+ *
+ * Returns the cache object.
+ */
+RBCache *rbcache_create(RBNodeAlloc *alloc, RBNodeFree *free,
+                        uint64_t limit_size, int eviction_type, void *opaque);
+
+/**
+ * rbcache_destroy:
+ * @rbcache: the cache object.
+ *
+ * Cleanup the cache object created with rbcache_create().
+ */
+void rbcache_destroy(RBCache *rbcache);
+
+#endif /* RBCACHE_H */
diff --git a/util/Makefile.objs b/util/Makefile.objs
index 727a567..9fb0de6 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -38,3 +38,4 @@  util-obj-y += qdist.o
 util-obj-y += qht.o
 util-obj-y += range.o
 util-obj-y += rbtree.o
+util-obj-y += rbcache.o
diff --git a/util/rbcache.c b/util/rbcache.c
new file mode 100644
index 0000000..05cfa5a
--- /dev/null
+++ b/util/rbcache.c
@@ -0,0 +1,246 @@ 
+/*
+ * QEMU Range-Based Cache core
+ *
+ * Copyright (C) 2015-2016 Parallels IP Holdings GmbH. All rights reserved.
+ *
+ * Author: Pavel Butsykin <pbutsykin@virtuozzo.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/rbcache.h"
+
+/* RBCache provides functionality to cache the data from block devices
+ * (basically). The range here is used as the main key for searching and storing
+ * data. The cache is based on red-black trees, so basic operations search,
+ * insert, delete are performed for O(log n).
+ *
+ * It is important to note that QEMU usually does not require a data cache, but
+ * in reality, there are already some cases where a cache of small amounts can
+ * increase performance, so as the data structure was selected red-black trees,
+ * this is a quite simple data structure and show high efficiency on a small
+ * number of elements. Therefore, when the minimum range is 512 bytes, the
+ * recommended size of the cache memory no more than 8-16mb. Also note that this
+ * cache implementation allows to store ranges of different lengths without
+ * alignment.
+ */
+
+struct RBCache {
+    struct RbRoot root;
+    RBNodeAlloc *alloc;
+    RBNodeFree  *free;
+    uint64_t limit_size;
+    uint64_t cur_size;
+    int eviction_type;
+    void *opaque;
+
+    QTAILQ_HEAD(RBCacheNodeHead, RBCacheNode) queue;
+};
+
+static int node_key_cmp(const RBCacheNode *node1, const RBCacheNode *node2)
+{
+    assert(node1 != NULL);
+    assert(node2 != NULL);
+
+    if (node1->offset >= node2->offset + node2->bytes) {
+        return 1;
+    }
+    if (node1->offset + node1->bytes <= node2->offset) {
+        return -1;
+    }
+
+    return 0;
+}
+
+static RBCacheNode *node_previous(RBCacheNode* node, uint64_t target_offset)
+{
+    while (node) {
+        struct RbNode *prev_rb_node = rb_prev(&node->rb_node);
+        RBCacheNode *prev_node;
+        if (prev_rb_node == NULL) {
+            break;
+        }
+        prev_node = container_of(prev_rb_node, RBCacheNode, rb_node);
+        if (prev_node->offset + prev_node->bytes <= target_offset) {
+            break;
+        }
+        node = prev_node;
+    }
+
+    assert(node != NULL);
+
+    return node;
+}
+
+RBCacheNode *rbcache_node_alloc(RBCache *rbcache, uint64_t offset,
+                                uint64_t bytes)
+{
+    RBCacheNode *node;
+
+    if (rbcache->alloc) {
+        node = rbcache->alloc(offset, bytes, rbcache->opaque);
+    } else {
+        node = g_malloc(sizeof(*node));
+    }
+
+    node->offset = offset;
+    node->bytes  = bytes;
+
+    return node;
+}
+
+void rbcache_node_free(RBCache *rbcache, RBCacheNode *node)
+{
+    if (rbcache->free) {
+        rbcache->free(node, rbcache->opaque);
+    } else {
+        g_free(node);
+    }
+}
+
+static void rbcache_try_shrink(RBCache *rbcache)
+{
+    while (rbcache->cur_size > rbcache->limit_size) {
+        RBCacheNode *node;
+        assert(!QTAILQ_EMPTY(&rbcache->queue));
+
+        node = QTAILQ_LAST(&rbcache->queue, RBCacheNodeHead);
+
+        rbcache_remove(rbcache, node);
+    }
+}
+
+static inline void node_move_in_queue(RBCache *rbcache, RBCacheNode *node)
+{
+    if (rbcache->eviction_type == RBCACHE_LRU) {
+        QTAILQ_REMOVE(&rbcache->queue, node, entry);
+        QTAILQ_INSERT_HEAD(&rbcache->queue, node, entry);
+    }
+}
+
+static RBCacheNode *node_insert(RBCache *rbcache, RBCacheNode *node, bool alloc)
+{
+    struct RbNode **new, *parent = NULL;
+
+    assert(rbcache != NULL);
+    assert(node->bytes != 0);
+
+    /* Figure out where to put new node */
+    new = &(rbcache->root.rb_node);
+    while (*new) {
+        RBCacheNode *this = container_of(*new, RBCacheNode, rb_node);
+        int result = node_key_cmp(node, this);
+        if (result == 0) {
+            this = node_previous(this, node->offset);
+            node_move_in_queue(rbcache, this);
+            return this;
+        }
+        parent = *new;
+        new = result < 0 ? &((*new)->rb_left) : &((*new)->rb_right);
+    }
+
+    if (alloc) {
+        node = rbcache_node_alloc(rbcache, node->offset, node->bytes);
+    }
+    /* Add new node and rebalance tree. */
+    rb_link_node(&node->rb_node, parent, new);
+    rb_insert_color(&node->rb_node, &rbcache->root);
+
+    rbcache->cur_size += node->bytes;
+
+    rbcache_try_shrink(rbcache);
+
+    QTAILQ_INSERT_HEAD(&rbcache->queue, node, entry);
+
+    return node;
+}
+
+void *rbcache_search(RBCache *rbcache, uint64_t offset, uint64_t bytes)
+{
+    struct RbNode *rb_node;
+    RBCacheNode node = {
+        .offset = offset,
+        .bytes  = bytes,
+    };
+
+    assert(rbcache != NULL);
+
+    rb_node = rbcache->root.rb_node;
+    while (rb_node) {
+        RBCacheNode *this = container_of(rb_node, RBCacheNode, rb_node);
+        int32_t result = node_key_cmp(&node, this);
+        if (result == 0) {
+            this = node_previous(this, offset);
+            node_move_in_queue(rbcache, this);
+            return this;
+        }
+        rb_node = result < 0 ? rb_node->rb_left : rb_node->rb_right;
+    }
+    return NULL;
+}
+
+void *rbcache_insert(RBCache *rbcache, RBCacheNode *node)
+{
+    return node_insert(rbcache, node, false);
+}
+
+void *rbcache_search_and_insert(RBCache *rbcache, uint64_t offset,
+                                uint64_t bytes)
+{
+    RBCacheNode node = {
+        .offset = offset,
+        .bytes  = bytes,
+    };
+
+    return node_insert(rbcache, &node, true);
+}
+
+void rbcache_remove(RBCache *rbcache, RBCacheNode *node)
+{
+    assert(rbcache->cur_size >= node->bytes);
+
+    rbcache->cur_size -= node->bytes;
+    rb_erase(&node->rb_node, &rbcache->root);
+
+    QTAILQ_REMOVE(&rbcache->queue, node, entry);
+
+    rbcache_node_free(rbcache, node);
+}
+
+RBCache *rbcache_create(RBNodeAlloc *alloc, RBNodeFree *free,
+                        uint64_t limit_size, int eviction_type, void *opaque)
+{
+    RBCache *rbcache = g_malloc(sizeof(*rbcache));
+
+
+    /* We can't use only one callback, or both or neither */
+    assert(!(!alloc ^ !free));
+
+    rbcache->root   = RB_ROOT;
+    rbcache->alloc  = alloc;
+    rbcache->free   = free;
+    rbcache->opaque = opaque;
+    rbcache->limit_size = limit_size;
+    rbcache->cur_size   = 0;
+    rbcache->eviction_type = eviction_type;
+
+    QTAILQ_INIT(&rbcache->queue);
+
+    return rbcache;
+}
+
+void rbcache_destroy(RBCache *rbcache)
+{
+    RBCacheNode *node, *next;
+
+    assert(rbcache != NULL);
+
+    QTAILQ_FOREACH_SAFE(node, &rbcache->queue, entry, next) {
+        QTAILQ_REMOVE(&rbcache->queue, node, entry);
+        rbcache_node_free(rbcache, node);
+    }
+
+    g_free(rbcache);
+}