diff mbox

[v2] kvm tools: Add QCOW level2 caching support

Message ID 1306956366-4634-1-git-send-email-prasadjoshi124@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Prasad Joshi June 1, 2011, 7:26 p.m. UTC
QCOW uses two tables level1 (L1) table and level2 (L2) table. The L1 table
points to offset of L2 table. When a QCOW image is probed, the L1 table is
cached in the memory to avoid reading it from disk on every access. This
caching improves the performance.

The similar performance improvement can be observed when L2 tables are cached.
It is impossible to cache all of the L2 tables because of the memory
constraint. The patch adds L2 table caching capability for up to 128 L2 tables,
it uses combination of RB tree and List to manage the L2 cached tables. The
link list implementation helps in building simple LRU structure and RB tree
helps to search cached table efficiently

The performance numbers are below, the machine was started with following
command line arguments

$ ./kvm run -d /home/prasad/VMDisks/Ubuntu10.10_64_cilk_qemu.qcow \
> --params "root=/dev/vda1" -m 409

Without QCOW caching

Comments

Ingo Molnar June 2, 2011, 7:28 a.m. UTC | #1
* Prasad Joshi <prasadjoshi124@gmail.com> wrote:

> Summary of performance numbers
> ==============================
> There is not much difference with sequential character operations are
> performed, the code with caching performed better by small margin. The caching
> code performance raised by 12% with sequential block output and dropped by
> 0.5% with sequential block input. The caching code also suffered with
> Random seeks and performed badly by 12%. The performance numbers drastically
> improved with sequential creates (62%) and delete operations (30%).

Looking at the numbers i think it's pretty clear that from this point 
on the quality of IO tests should be improved: Bonnie is too noisy 
and does not cut it anymore for finer enhancements.

To make measurements easier you could also do a simple trick: put 
*all* of the disk image into /dev/shm and add a command-line debug 
option that add a fixed-amount udelay(1000) call every time the code 
reads from the disk image.

This introduces a ~1msec delay and thus simulates IO, but the delays 
are *constant* [make sure you use a high-res timers kernel], so they 
do not result in nearly as much measurement noise as real block IO 
does.

The IO delays will still be there, so any caching advantages (and CPU 
overhead reductions) will be measurable very clearly.

This way you are basically 'emulating' a real disk drive but you will 
emulate uniform latencies, which makes measurements a lot more 
reliable - while still relevant to the end result.

So if under such a measurement model you can prove an improvement 
with a patch, that improvement will be there with real disks as well 
- just harder to prove.

Wanna try this? I really think you are hitting the limits of your 
current measurement methodology and you will be wasting time running 
more measurements instead of saving time doing more intelligent 
measurements ;-)

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Prasad Joshi June 2, 2011, 8:36 a.m. UTC | #2
On Thu, Jun 2, 2011 at 8:28 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Prasad Joshi <prasadjoshi124@gmail.com> wrote:
>
>> Summary of performance numbers
>> ==============================
>> There is not much difference with sequential character operations are
>> performed, the code with caching performed better by small margin. The caching
>> code performance raised by 12% with sequential block output and dropped by
>> 0.5% with sequential block input. The caching code also suffered with
>> Random seeks and performed badly by 12%. The performance numbers drastically
>> improved with sequential creates (62%) and delete operations (30%).
>
> Looking at the numbers i think it's pretty clear that from this point
> on the quality of IO tests should be improved: Bonnie is too noisy
> and does not cut it anymore for finer enhancements.
>
> To make measurements easier you could also do a simple trick: put
> *all* of the disk image into /dev/shm and add a command-line debug
> option that add a fixed-amount udelay(1000) call every time the code
> reads from the disk image.
>
> This introduces a ~1msec delay and thus simulates IO, but the delays
> are *constant* [make sure you use a high-res timers kernel], so they
> do not result in nearly as much measurement noise as real block IO
> does.
>
> The IO delays will still be there, so any caching advantages (and CPU
> overhead reductions) will be measurable very clearly.
>
> This way you are basically 'emulating' a real disk drive but you will
> emulate uniform latencies, which makes measurements a lot more
> reliable - while still relevant to the end result.
>
> So if under such a measurement model you can prove an improvement
> with a patch, that improvement will be there with real disks as well
> - just harder to prove.
>
> Wanna try this? I really think you are hitting the limits of your
> current measurement methodology and you will be wasting time running
> more measurements instead of saving time doing more intelligent
> measurements ;-)
>

Thanks Ingo, will try this method.

I am not sure how to induce the delay you mentioned. I could vaguely
remember you suggesting something similar few days back. Let me see if
I can find out the correct arguments to use this delay, otherwise will
post a query.

Thanks and Regards,
Prasad

> Thanks,
>
>        Ingo
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar June 2, 2011, 8:57 a.m. UTC | #3
* Prasad Joshi <prasadjoshi124@gmail.com> wrote:

> I am not sure how to induce the delay you mentioned. [...]

In the simplest version just add:

	if (debug__io_delay)
		udelay(1000);

to the code that does a read() from the disk image. This will 
introduce a 1 msec delay - that should be plenty enough to simulate 
most effects of IO delays.

Later on you could complicate it with some disk geometry 
approximation:

	delay = read_size_in_kb;	/* 1 usec per KB read */
	delay += seek_distance_in_kb;   /* 1 usec per KB disk-head movement */
	udelay(delay);

Where 'read_size_in_kb' is the number of bytes read in KB, while 
seek_distance_in_kb measures the position of the the last byte read 
by the previous read() call to the first byte of the current read().

( Also, instead of copying the disk image to /dev/shm you could add a 
  debug switch to mmap() and mlock() it directly. Assuming there's enough
  RAM in the box. )

But i'd strongly suggest to keep the initial version simple.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar June 2, 2011, 9:16 a.m. UTC | #4
* Ingo Molnar <mingo@elte.hu> wrote:

> This introduces a ~1msec delay and thus simulates IO, but the 
> delays are *constant* [make sure you use a high-res timers kernel], 
> so they do not result in nearly as much measurement noise as real 
> block IO does.
> 
> The IO delays will still be there, so any caching advantages (and 
> CPU overhead reductions) will be measurable very clearly.
> 
> This way you are basically 'emulating' a real disk drive but you 
> will emulate uniform latencies, which makes measurements a lot more 
> reliable - while still relevant to the end result.
> 
> So if under such a measurement model you can prove an improvement 
> with a patch, that improvement will be there with real disks as 
> well - just harder to prove.

Another risk that the current situation carries in itself, beyond 
making it more difficult to measure improvements, is that based on a 
"bad" Bonnie outlier or artifact you might throw away a perfectly 
good change accidentally!

So whenever you think you are fighting noise you need to improve your 
measurements, as entropy is a pretty tough opponent to beat.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Prasad Joshi June 2, 2011, 4:59 p.m. UTC | #5
On Thu, Jun 2, 2011 at 9:36 AM, Prasad Joshi <prasadjoshi124@gmail.com> wrote:
> On Thu, Jun 2, 2011 at 8:28 AM, Ingo Molnar <mingo@elte.hu> wrote:
>>
>> * Prasad Joshi <prasadjoshi124@gmail.com> wrote:
>>
>>> Summary of performance numbers
>>> ==============================
>>> There is not much difference with sequential character operations are
>>> performed, the code with caching performed better by small margin. The caching
>>> code performance raised by 12% with sequential block output and dropped by
>>> 0.5% with sequential block input. The caching code also suffered with
>>> Random seeks and performed badly by 12%. The performance numbers drastically
>>> improved with sequential creates (62%) and delete operations (30%).
>>
>> Looking at the numbers i think it's pretty clear that from this point
>> on the quality of IO tests should be improved: Bonnie is too noisy
>> and does not cut it anymore for finer enhancements.
>>
>> To make measurements easier you could also do a simple trick: put
>> *all* of the disk image into /dev/shm and add a command-line debug
>> option that add a fixed-amount udelay(1000) call every time the code
>> reads from the disk image.
>>
>> This introduces a ~1msec delay and thus simulates IO, but the delays
>> are *constant* [make sure you use a high-res timers kernel], so they
>> do not result in nearly as much measurement noise as real block IO
>> does.
>>
>> The IO delays will still be there, so any caching advantages (and CPU
>> overhead reductions) will be measurable very clearly.
>>
>> This way you are basically 'emulating' a real disk drive but you will
>> emulate uniform latencies, which makes measurements a lot more
>> reliable - while still relevant to the end result.
>>
>> So if under such a measurement model you can prove an improvement
>> with a patch, that improvement will be there with real disks as well
>> - just harder to prove.
>>
>> Wanna try this? I really think you are hitting the limits of your
>> current measurement methodology and you will be wasting time running
>> more measurements instead of saving time doing more intelligent
>> measurements ;-)
>>
>
> Thanks Ingo, will try this method.
>
> I am not sure how to induce the delay you mentioned. I could vaguely
> remember you suggesting something similar few days back. Let me see if
> I can find out the correct arguments to use this delay, otherwise will
> post a query.
>

I repeated the test after copying the image file on /dev/shm as
suggested by Ingo

BEFORE CACHING
=================
$ bonnie++ -d tmp/ -c 2 -s 1024
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   2     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
prasad-virtual-m 1G   996  99 433356  59 172755  42  5331 100 339838
49 15534 749
Latency             17704us   49654us   61299us    6152us    1838us     106ms
Version  1.96       ------Sequential Create------ --------Random Create--------
prasad-virtual-mach -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               475us     315us     358us     565us      56us      91us
1.96,1.96,prasad-virtual-machine,2,1307032968,1G,,996,99,433356,59,172755,42,5331,100,339838,49,15534,749,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,17704us,49654us,61299us,6152us,1838us,106ms,475us,315us,358us,565us,56us,91us


AFTER CACHING
===============
$ bonnie++ -d tmp/ -c 2 -s 1024
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   2     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
prasad-virtual-m 1G  1054  99 558323  74 226297  55  5436  99 504042
70 +++++ +++
Latency             12776us   23582us   39912us    6778us   20050us   30421us
Version  1.96       ------Sequential Create------ --------Random Create--------
prasad-virtual-mach -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               334us     391us     427us     472us      82us      79us
1.96,1.96,prasad-virtual-machine,2,1307032815,1G,,1054,99,558323,74,226297,55,5436,99,504042,70,+++++,+++,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,12776us,23582us,39912us,6778us,20050us,30421us,334us,391us,427us,472us,82us,79us

During both the tests the machine was started with 512MB memory and
test was performed without any additional the IO delay.

The tests shows that caching improves the performance. Thanks Ingo.

Regards,
Prasad


> Thanks and Regards,
> Prasad
>
>> Thanks,
>>
>>        Ingo
>>
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

====================
$ sudo bonnie++ -d temp/ -c 10 -s 10000 -u 0
[sudo] password for prasad:
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency  10     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
prasad-virtu 10000M   923  91 49194  19 32315  12  3599  71 104608  15 126.1  22
Latency             21698us    7745ms    5615ms   66662us     185ms    1898ms
Version  1.96       ------Sequential Create------ --------Random Create--------
prasad-virtual-mach -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 16610  17 +++++ +++ 19248  15 28493  26 +++++ +++ 32094  29
Latency               174ms    1184us     353us     641us     108us      63us
1.96,1.96,prasad-virtual-machine,10,1306952203,10000M,,923,91,49194,19,32315,
12,3599,71,104608,15,126.1,22,16,,,,,16610,17,+++++,+++,19248,15,28493,26,
+++++,+++,32094,29,21698us,7745ms,5615ms,66662us,185ms,1898ms,174ms,1184us,
353us,641us,108us,63us

With QCOW caching
=================
$ sudo bonnie++ -d temp/ -c 10 -s 10000 -u 0
[sudo] password for prasad:
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency  10     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
prasad-virtu 10000M   930  93 55257  18 33171  12  3731  71 104183  15 102.2  18
Latency             30910us   10347ms    5578ms     103ms     119ms    1209ms
Version  1.96       ------Sequential Create------ --------Random Create--------
prasad-virtual-mach -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 26923  29 +++++ +++ 25039  19 +++++ +++ +++++ +++ +++++ +++
Latency             40450us    1227us     394us     513us     102us      56us
1.96,1.96,prasad-virtual-machine,10,1306953593,10000M,,930,93,55257,18,33171,
12,3731,71,104183,15,102.2,18,16,,,,,26923,29,+++++,+++,25039,19,+++++,+++,
+++++,+++,+++++,+++,30910us,10347ms,5578ms,103ms,119ms,1209ms,40450us,1227us,
394us,513us,102us,56us

Summary of performance numbers
==============================
There is not much difference with sequential character operations are
performed, the code with caching performed better by small margin. The caching
code performance raised by 12% with sequential block output and dropped by
0.5% with sequential block input. The caching code also suffered with
Random seeks and performed badly by 12%. The performance numbers drastically
improved with sequential creates (62%) and delete operations (30%).

Signed-off-by: Prasad Joshi <prasadjoshi124@gmail.com>
---
 tools/kvm/disk/qcow.c        |  220 ++++++++++++++++++++++++++++++++++++++---
 tools/kvm/include/kvm/qcow.h |   17 +++
 2 files changed, 220 insertions(+), 17 deletions(-)

diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
index b52b734..7eb4cc7 100644
--- a/tools/kvm/disk/qcow.c
+++ b/tools/kvm/disk/qcow.c
@@ -16,6 +16,144 @@ 
 #include <linux/kernel.h>
 #include <linux/types.h>
 
+static int insert(struct rb_root *root, struct qcow_l2_cache *new)
+{
+	struct rb_node **link = &(root->rb_node), *parent = NULL;
+	u64 offset = new->offset;
+	
+	/* search the tree */
+	while (*link) {
+		struct qcow_l2_cache *t;
+
+		t = rb_entry(*link, struct qcow_l2_cache, node);
+		if (!t)
+			goto error;
+
+		parent = *link;
+
+		if (t->offset > offset)
+			link = &(*link)->rb_left;
+		else if (t->offset < offset)
+			link = &(*link)->rb_right;
+		else
+			goto out;
+	}
+
+	/* add new node */
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, root);
+out:
+	return 0;
+error:
+	return -1;
+}
+
+static struct qcow_l2_cache *search(struct rb_root *root, u64 offset)
+{
+	struct rb_node *link = root->rb_node;
+
+	while (link) {
+		struct qcow_l2_cache *t;
+
+		t = rb_entry(link, struct qcow_l2_cache, node);
+		if (!t)
+			goto out;
+
+		if (t->offset > offset)
+			link = link->rb_left;
+		else if (t->offset < offset)
+			link = link->rb_right;
+		else
+			return t;
+	}
+out:
+	return NULL;
+}
+
+static void free_cache(struct qcow *q)
+{
+	struct list_head *pos, *n;
+	struct qcow_l2_cache *t;
+	struct rb_root *r = &q->root;
+
+	list_for_each_safe(pos, n, &q->lru_list) {
+		/* Remove cache table from the list and RB tree */
+		list_del(pos);
+		t = list_entry(pos, struct qcow_l2_cache, list);
+		rb_erase(&t->node, r);
+
+		/* Free the cached node */
+		free(t->table);
+		free(t);
+	}
+}
+
+static int cache_table(struct qcow *q, u64 *table, u64 offset)
+{
+	struct qcow_l2_cache *n;
+	struct rb_root *r = &q->root;
+	struct qcow_l2_cache *lru;
+
+	n = search(r, offset);
+	if (n) {
+		/* Update the LRU state */
+		list_del_init(&n->list);
+		list_add_tail(&n->list, &q->lru_list);
+		return 0;
+	}
+
+	n = calloc(1, sizeof(struct qcow_l2_cache));
+	if (!n)
+		goto error;
+
+	n->offset = offset;
+	n->table  = table;
+	n->qcow   = q;
+	RB_CLEAR_NODE(&n->node);
+	INIT_LIST_HEAD(&n->list);
+
+	if (q->no_cached == MAX_CACHE_NODES) {
+		/* Find the node for LRU replacement */
+		lru = list_first_entry(&q->lru_list, struct qcow_l2_cache, list);
+
+		/* Remove the node from the cache */
+		rb_erase(&lru->node, r);
+		list_del_init(&lru->list);
+		q->no_cached--;
+
+		/* Free the LRUed node */
+		free(lru->table);
+		free(lru);
+	}
+
+	/* Add new node in RB Tree: Helps in searching faster */
+	if (insert(r, n) < 0)
+		goto error;
+
+	/* Add in LRU replacement list */
+	list_add_tail(&n->list, &q->lru_list);
+	q->no_cached++;
+
+	return 0;
+error:
+	free(n);
+	return -1;
+}
+
+static int search_table(struct qcow *q, u64 **table, u64 offset)
+{
+	struct qcow_l2_cache *c;
+
+	*table = NULL;
+
+	c = search(&q->root, offset);
+	if (!c)
+		return -1;
+
+	*table = c->table;
+	return 0;
+}
+
 static inline u64 get_l1_index(struct qcow *q, u64 offset)
 {
 	struct qcow_header *header = q->header;
@@ -37,6 +175,43 @@  static inline u64 get_cluster_offset(struct qcow *q, u64 offset)
 	return offset & ((1 << header->cluster_bits)-1);
 }
 
+static int qcow_read_l2_table(struct qcow *q, u64 **table, u64 offset)
+{
+	struct qcow_header *header = q->header;
+	u64 size;
+	u64 i;
+	u64 *t;
+
+	*table = NULL;
+	size   = 1 << header->l2_bits;
+
+	/* search an entry for offset in cache */
+	if (search_table(q, table, offset) >= 0)
+		return 0;
+
+	t = calloc(size, sizeof(*t));
+	if (!t)
+		goto error;
+
+	/* table not cached: read from the disk */
+	if (pread_in_full(q->fd, t, size * sizeof(u64), offset) < 0)
+		goto error;
+
+	/* cache the table */
+	if (cache_table(q, t, offset) < 0)
+		goto error;
+
+	/* change cached table to CPU's byte-order */
+	for (i = 0; i < size; i++)
+		be64_to_cpus(&t[i]);
+
+	*table = t;
+	return 0;
+error:
+	free(t);
+	return -1;
+}
+
 static ssize_t qcow_read_cluster(struct qcow *q, u64 offset, void *dst, u32 dst_len)
 {
 	struct qcow_header *header = q->header;
@@ -51,7 +226,6 @@  static ssize_t qcow_read_cluster(struct qcow *q, u64 offset, void *dst, u32 dst_
 	u64 l1_idx;
 	u64 l2_idx;
 
-	l2_table = NULL;
 
 	cluster_size = 1 << header->cluster_bits;
 
@@ -72,18 +246,16 @@  static ssize_t qcow_read_cluster(struct qcow *q, u64 offset, void *dst, u32 dst_
 		goto zero_cluster;
 
 	l2_table_size = 1 << header->l2_bits;
-	l2_table = calloc(l2_table_size, sizeof(u64));
-	if (!l2_table)
-		goto out_error;
 
-	if (pread_in_full(q->fd, l2_table, l2_table_size * sizeof(u64), l2_table_offset) < 0)
+	/* read and cache level 2 table */
+	if (qcow_read_l2_table(q, &l2_table, l2_table_offset) < 0)
 		goto out_error;
 
 	l2_idx = get_l2_index(q, offset);
 	if (l2_idx >= l2_table_size)
 		goto out_error;
 
-	clust_start = be64_to_cpu(l2_table[l2_idx]) & ~header->oflag_mask;
+	clust_start = l2_table[l2_idx] & ~header->oflag_mask;
 	if (!clust_start)
 		goto zero_cluster;
 
@@ -91,7 +263,6 @@  static ssize_t qcow_read_cluster(struct qcow *q, u64 offset, void *dst, u32 dst_
 		goto out_error;
 
 out:
-	free(l2_table);
 	return length;
 
 zero_cluster:
@@ -221,15 +392,16 @@  static ssize_t qcow_write_cluster(struct qcow *q, u64 offset, void *buf, u32 src
 	if (len > src_len)
 		len = src_len;
 
-	l2t = calloc(l2t_sz, sizeof(u64));
-	if (!l2t)
-		goto error;
-
 	l2t_off		= table->l1_table[l1t_idx] & ~header->oflag_mask;
 	if (l2t_off) {
-		if (pread_in_full(q->fd, l2t, l2t_sz * sizeof(u64), l2t_off) < 0)
-			goto free_l2;
+		/* cache level2 table */
+		if (qcow_read_l2_table(q, &l2t, l2t_off) < 0)
+			goto error;
 	} else {
+		l2t = calloc(l2t_sz, sizeof(*l2t));
+		if (!l2t)
+			goto error;
+
 		/* Capture the state of the consistent QCOW image */
 		f_sz		= file_size(q->fd);
 		if (!f_sz)
@@ -251,6 +423,13 @@  static ssize_t qcow_write_cluster(struct qcow *q, u64 offset, void *buf, u32 src
 			goto free_l2;
 		}
 
+		if (cache_table(q, l2t, l2t_off) < 0) {
+			if (ftruncate(q->fd, f_sz) < 0)
+				goto free_l2;
+
+			goto free_l2;
+		}
+
 		/* Update the in-core entry */
 		table->l1_table[l1t_idx] = l2t_off;
 	}
@@ -258,17 +437,15 @@  static ssize_t qcow_write_cluster(struct qcow *q, u64 offset, void *buf, u32 src
 	/* Capture the state of the consistent QCOW image */
 	f_sz		= file_size(q->fd);
 	if (!f_sz)
-		goto free_l2;
+		goto error;
 
-	clust_start	= be64_to_cpu(l2t[l2t_idx]) & ~header->oflag_mask;
+	clust_start	= l2t[l2t_idx] & ~header->oflag_mask;
 	if (!clust_start) {
 		clust_start	= ALIGN(f_sz, clust_sz);
 		update_meta	= true;
 	} else
 		update_meta	= false;
 
-	free(l2t);
-
 	/* Write actual data */
 	if (pwrite_in_full(q->fd, buf, len, clust_start + clust_off) < 0)
 		goto error;
@@ -282,9 +459,13 @@  static ssize_t qcow_write_cluster(struct qcow *q, u64 offset, void *buf, u32 src
 
 			goto error;
 		}
+
+		/* Update the cached level2 entry */
+		l2t[l2t_idx] = clust_start;
 	}
 
 	return len;
+
 free_l2:
 	free(l2t);
 error:
@@ -336,6 +517,7 @@  static int qcow_disk_close(struct disk_image *disk)
 
 	q = disk->priv;
 
+	free_cache(q);
 	free(q->table.l1_table);
 	free(q->header);
 	free(q);
@@ -428,6 +610,8 @@  static struct disk_image *qcow2_probe(int fd, bool readonly)
 		goto error;
 
 	q->fd = fd;
+	q->root = RB_ROOT;
+	INIT_LIST_HEAD(&q->lru_list);
 
 	h = q->header = qcow2_read_header(fd);
 	if (!h)
@@ -525,6 +709,8 @@  static struct disk_image *qcow1_probe(int fd, bool readonly)
 		goto error;
 
 	q->fd = fd;
+	q->root = RB_ROOT;
+	INIT_LIST_HEAD(&q->lru_list);
 
 	h = q->header = qcow1_read_header(fd);
 	if (!h)
diff --git a/tools/kvm/include/kvm/qcow.h b/tools/kvm/include/kvm/qcow.h
index b6e7493..1663819 100644
--- a/tools/kvm/include/kvm/qcow.h
+++ b/tools/kvm/include/kvm/qcow.h
@@ -3,6 +3,8 @@ 
 
 #include <linux/types.h>
 #include <stdbool.h>
+#include <linux/rbtree.h>
+#include <linux/list.h>
 
 #define QCOW_MAGIC		(('Q' << 24) | ('F' << 16) | ('I' << 8) | 0xfb)
 
@@ -17,6 +19,16 @@ 
 #define QCOW2_OFLAG_COMPRESSED	(1LL << 62)
 #define QCOW2_OFLAG_MASK	(QCOW2_OFLAG_COPIED|QCOW2_OFLAG_COMPRESSED)
 
+#define MAX_CACHE_NODES         128
+
+struct qcow_l2_cache {
+	u64                     offset;
+	u64                     *table;
+	struct rb_node          node;
+	struct list_head        list;
+	struct qcow             *qcow;
+};
+
 struct qcow_table {
 	u32			table_size;
 	u64			*l1_table;
@@ -26,6 +38,11 @@  struct qcow {
 	void			*header;
 	struct qcow_table	table;
 	int			fd;
+
+	/* Level2 caching data structures */
+	struct rb_root          root;
+	struct list_head        lru_list;
+	int                     no_cached;
 };
 
 struct qcow_header {