Message ID | 1302351248-4853-1-git-send-email-levinsasha928@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 09.04.2011, at 14:14, Sasha Levin wrote:
> Attempt to use mmap first for working with a disk image, if the attempt is failed (for example, large image on a 32bit system) fallback to using read/write.
That reminds me of an idea I had quite a while back.
What if we mmap'ed a raw disk image directly into the guest's address space? This could for example be done through a virtio feature addition, keeping the disk accessible through normal virtio plus the mmap'ed part. At least in writeback mode, this should perform pretty well, as we'd save all the userspace exits. It'd basically be almost like vhost-blk :).
Have you thought about trying out to implement such a feature?
Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Alexander, On 09.04.2011, at 14:14, Sasha Levin wrote: >> Attempt to use mmap first for working with a disk image, if the attempt is failed (for example, large image on a 32bit system) fallback to using read/write. On Sat, Apr 9, 2011 at 3:20 PM, Alexander Graf <agraf@suse.de> wrote: > That reminds me of an idea I had quite a while back. > > What if we mmap'ed a raw disk image directly into the guest's address space? This could for example be done through a virtio feature addition, keeping the disk accessible through normal virtio plus the mmap'ed part. At least in writeback mode, this should perform pretty well, as we'd save all the userspace exits. It'd basically be almost like vhost-blk :). > > Have you thought about trying out to implement such a feature? No, we haven't but that sounds like something worth trying out! Pekka -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/09/2011 03:20 PM, Alexander Graf wrote: > On 09.04.2011, at 14:14, Sasha Levin wrote: > > > Attempt to use mmap first for working with a disk image, if the attempt is failed (for example, large image on a 32bit system) fallback to using read/write. > > That reminds me of an idea I had quite a while back. > > What if we mmap'ed a raw disk image directly into the guest's address space? This could for example be done through a virtio feature addition, keeping the disk accessible through normal virtio plus the mmap'ed part. At least in writeback mode, this should perform pretty well, as we'd save all the userspace exits. It'd basically be almost like vhost-blk :). > > Have you thought about trying out to implement such a feature? A creative idea, but I don't think it will work. On EPT hosts we don't have accessed/dirty bits so you have to incur at least write faults to track dirty data and perhaps read faults to gather recency information. On non-EPT you have to scan page tables to find out what you have to write out, and flush TLBs. Cache misses, which you'd expect there to be quite a few, would stall the vcpu (unless you use asynchronous page faults) and contribute less information to the host than virtio-blk (location of access but not size). Write misses are converted to read-modify-write operations. Even in userspace I think mmap() is usually not a performance win. It's advantages are that it's simple to use and has low set-up time, which is useful for short lived processes, especially with read-only backing files. For virtual machines it's much worse.
On Sun, Apr 10, 2011 at 11:15:29AM +0300, Avi Kivity wrote: > On 04/09/2011 03:20 PM, Alexander Graf wrote: > >On 09.04.2011, at 14:14, Sasha Levin wrote: > > > >> Attempt to use mmap first for working with a disk image, if the attempt is failed (for example, large image on a 32bit system) fallback to using read/write. > > > >That reminds me of an idea I had quite a while back. > > > >What if we mmap'ed a raw disk image directly into the guest's address space? This could for example be done through a virtio feature addition, keeping the disk accessible through normal virtio plus the mmap'ed part. At least in writeback mode, this should perform pretty well, as we'd save all the userspace exits. It'd basically be almost like vhost-blk :). > > > >Have you thought about trying out to implement such a feature? > > A creative idea, but I don't think it will work. On EPT hosts we > don't have accessed/dirty bits so you have to incur at least write > faults to track dirty data and perhaps read faults to gather recency > information. On non-EPT you have to scan page tables to find out > what you have to write out, and flush TLBs. Cache misses, which > you'd expect there to be quite a few, would stall the vcpu (unless > you use asynchronous page faults) and contribute less information to > the host than virtio-blk (location of access but not size). Write > misses are converted to read-modify-write operations. > Guest kernel can keep track of all modified sectors and pass it to hypervisor with sync command. > Even in userspace I think mmap() is usually not a performance win. > It's advantages are that it's simple to use and has low set-up time, > which is useful for short lived processes, especially with read-only > backing files. For virtual machines it's much worse. > > -- > error compiling committee.c: too many arguments to function > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10.04.2011, at 10:15, Avi Kivity wrote: > On 04/09/2011 03:20 PM, Alexander Graf wrote: >> On 09.04.2011, at 14:14, Sasha Levin wrote: >> >> > Attempt to use mmap first for working with a disk image, if the attempt is failed (for example, large image on a 32bit system) fallback to using read/write. >> >> That reminds me of an idea I had quite a while back. >> >> What if we mmap'ed a raw disk image directly into the guest's address space? This could for example be done through a virtio feature addition, keeping the disk accessible through normal virtio plus the mmap'ed part. At least in writeback mode, this should perform pretty well, as we'd save all the userspace exits. It'd basically be almost like vhost-blk :). >> >> Have you thought about trying out to implement such a feature? > > A creative idea, but I don't think it will work. On EPT hosts we don't have accessed/dirty bits so you have to incur at least write faults to track dirty data and perhaps read faults to gather recency information. On non-EPT you have to scan page tables to find out what you have to write out, and flush TLBs. Cache misses, which you'd expect there to be quite a few, would stall the vcpu (unless you use asynchronous page faults) and contribute less information to the host than virtio-blk (location of access but not size). Write misses are converted to read-modify-write operations. Since we're moving the 4k sector sizes, the RMW argument shouldn't matter too much in a couple of years from now. As for the faults, yes. We'd basically have to declare the file region as dirty logged, which means we get lots of page faults when accessing them. However, these are all lightweight exits. So we take one lightweight exit for each 4k chunk. when doing writes. For reads, we probably really do need asynchronous page faults - everything else would stall the vcpus way too long. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/10/2011 11:30 AM, Alexander Graf wrote: > On 10.04.2011, at 10:15, Avi Kivity wrote: > > > On 04/09/2011 03:20 PM, Alexander Graf wrote: > >> On 09.04.2011, at 14:14, Sasha Levin wrote: > >> > >> > Attempt to use mmap first for working with a disk image, if the attempt is failed (for example, large image on a 32bit system) fallback to using read/write. > >> > >> That reminds me of an idea I had quite a while back. > >> > >> What if we mmap'ed a raw disk image directly into the guest's address space? This could for example be done through a virtio feature addition, keeping the disk accessible through normal virtio plus the mmap'ed part. At least in writeback mode, this should perform pretty well, as we'd save all the userspace exits. It'd basically be almost like vhost-blk :). > >> > >> Have you thought about trying out to implement such a feature? > > > > A creative idea, but I don't think it will work. On EPT hosts we don't have accessed/dirty bits so you have to incur at least write faults to track dirty data and perhaps read faults to gather recency information. On non-EPT you have to scan page tables to find out what you have to write out, and flush TLBs. Cache misses, which you'd expect there to be quite a few, would stall the vcpu (unless you use asynchronous page faults) and contribute less information to the host than virtio-blk (location of access but not size). Write misses are converted to read-modify-write operations. > > Since we're moving the 4k sector sizes, the RMW argument shouldn't matter too much in a couple of years from now. I wasn't talking about the sector size, rather that memcpy(&mmapped_disk[sector * SECTOR_SIZE], data, SECTOR_SIZE) writes the data word by word, so on the first write you have to read in the entire page, then modify the first and following words. > As for the faults, yes. We'd basically have to declare the file region as dirty logged, which means we get lots of page faults when accessing them. However, these are all lightweight exits. So we take one lightweight exit for each 4k chunk. when doing writes. For reads, we probably really do need asynchronous page faults - everything else would stall the vcpus way too long. These are still very expensive, compared to virtio-blk which can get you megabytes worth of data with a single exit.
On 04/10/2011 11:23 AM, Gleb Natapov wrote: > > > > A creative idea, but I don't think it will work. On EPT hosts we > > don't have accessed/dirty bits so you have to incur at least write > > faults to track dirty data and perhaps read faults to gather recency > > information. On non-EPT you have to scan page tables to find out > > what you have to write out, and flush TLBs. Cache misses, which > > you'd expect there to be quite a few, would stall the vcpu (unless > > you use asynchronous page faults) and contribute less information to > > the host than virtio-blk (location of access but not size). Write > > misses are converted to read-modify-write operations. > > > Guest kernel can keep track of all modified sectors and pass it to > hypervisor with sync command. Should probably do the same for reads, to take advantage of batching. This can be easily implemented using the qemu ivshmem device, if someone wants to try it out.
How do you plan to handle I/O errors or ENOSPC conditions? Note that shared writeable mappings are by far the feature in the VM/FS code that is most error prone, including the impossiblity of doing sensible error handling. The version that accidentally used MAP_PRIVATE actually makes a lot of sense for an equivalent of qemu's snapshot mode where the image is readonly and changes are kept private as long as the amount of modified blocks is small enough to not kill the host VM, but using shared writeable mappings just sems dangerous. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 11, 2011 at 9:41 PM, Christoph Hellwig <hch@infradead.org> wrote: > How do you plan to handle I/O errors or ENOSPC conditions? Note that > shared writeable mappings are by far the feature in the VM/FS code > that is most error prone, including the impossiblity of doing sensible > error handling. Good point. I reverted the commit. Thanks! On Mon, Apr 11, 2011 at 9:41 PM, Christoph Hellwig <hch@infradead.org> wrote: > The version that accidentally used MAP_PRIVATE actually makes a lot of > sense for an equivalent of qemu's snapshot mode where the image is > readonly and changes are kept private as long as the amount of modified > blocks is small enough to not kill the host VM, but using shared > writeable mappings just sems dangerous. Yup, Sasha, mind submitting a MAP_PRIVATE version that's enabled with '--snapshot' (or equivalent) command line option. Pekka -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/tools/kvm/disk-image.c b/tools/kvm/disk-image.c index 9deaf45..8c27392 100644 --- a/tools/kvm/disk-image.c +++ b/tools/kvm/disk-image.c @@ -13,21 +13,31 @@ #include <unistd.h> #include <fcntl.h> -struct disk_image *disk_image__new(int fd, uint64_t size, struct disk_image_operations *ops) +static int raw_image__read_sector_mmap(struct disk_image *self, uint64_t sector, void *dst, uint32_t dst_len) { - struct disk_image *self; + uint64_t offset = sector << SECTOR_SHIFT; - self = malloc(sizeof *self); - if (!self) - return NULL; + if (offset + dst_len > self->size) + return -1; - self->fd = fd; - self->size = size; - self->ops = ops; + memcpy(dst, self->priv + offset, dst_len); - return self; + return 0; +} + +static int raw_image__write_sector_mmap(struct disk_image *self, uint64_t sector, void *src, uint32_t src_len) +{ + uint64_t offset = sector << SECTOR_SHIFT; + + if (offset + src_len > self->size) + return -1; + + memcpy(self->priv + offset, src, src_len); + + return 0; } + static int raw_image__read_sector(struct disk_image *self, uint64_t sector, void *dst, uint32_t dst_len) { uint64_t offset = sector << SECTOR_SHIFT; @@ -59,6 +69,28 @@ static struct disk_image_operations raw_image_ops = { .write_sector = raw_image__write_sector, }; +static struct disk_image_operations raw_image_mmap_ops = { + .read_sector = raw_image__read_sector_mmap, + .write_sector = raw_image__write_sector_mmap, +}; + +struct disk_image *disk_image__new(int fd, uint64_t size) +{ + struct disk_image *self; + void*k=&raw_image_mmap_ops; +k=NULL; + self = malloc(sizeof *self); + if (!self) + return NULL; + + self->fd = fd; + self->size = size; + self->priv = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_NORESERVE, fd, 0); + self->ops = (self->priv == MAP_FAILED) ? &raw_image_ops : &raw_image_ops; + + return self; +} + static struct disk_image *raw_image__probe(int fd) { struct stat st; @@ -66,7 +98,7 @@ static struct disk_image *raw_image__probe(int fd) if (fstat(fd, &st) < 0) return NULL; - return disk_image__new(fd, st.st_size, &raw_image_ops); + return disk_image__new(fd, st.st_size); } struct disk_image *disk_image__open(const char *filename) @@ -97,6 +129,9 @@ void disk_image__close(struct disk_image *self) if (self->ops->close) self->ops->close(self); + if (self->priv != MAP_FAILED) + munmap(self->priv, self->size); + if (close(self->fd) < 0) warning("close() failed"); diff --git a/tools/kvm/include/kvm/disk-image.h b/tools/kvm/include/kvm/disk-image.h index df0a15d..8b78657 100644 --- a/tools/kvm/include/kvm/disk-image.h +++ b/tools/kvm/include/kvm/disk-image.h @@ -22,7 +22,7 @@ struct disk_image { }; struct disk_image *disk_image__open(const char *filename); -struct disk_image *disk_image__new(int fd, uint64_t size, struct disk_image_operations *ops); +struct disk_image *disk_image__new(int fd, uint64_t size); void disk_image__close(struct disk_image *self); static inline int disk_image__read_sector(struct disk_image *self, uint64_t sector, void *dst, uint32_t dst_len)
Attempt to use mmap first for working with a disk image, if the attempt is failed (for example, large image on a 32bit system) fallback to using read/write. Performance (kB/s) test using bonnie++ showed the following improvement: kvm cmdline: ./kvm run --mem=256 --image=./work/vms/gentoo.img --kernel=/boot/bzImage-git bonnie++ cmdline: bonnie++ -u 0 Before: Version 1.96 ------Sequential Output----- --Sequential Input- -Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP tux 480M 673 100 308017 61 288156 49 3286 99 892186 76 +++++ +++ Latency 12998us 50992us 35993us 3000us 1999us 201ms Version 1.96 ------Sequential Create------ --------Random Create-------- tux -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 3000us 1000us 1000us 1000us 1998us 1000us After: Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP tux 480M 672 99 415549 45 310350 52 3495 99 910360 83 +++++ +++ Latency 12998us 7998us 23996us 3998us 1000us 116ms Version 1.96 ------Sequential Create------ --------Random Create-------- tux -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ Latency 1000us 1000us 1000us 2000us 1000us 1999us Signed-off-by: Sasha Levin <levinsasha928@gmail.com> --- tools/kvm/disk-image.c | 55 +++++++++++++++++++++++++++++------ tools/kvm/include/kvm/disk-image.h | 2 +- 2 files changed, 46 insertions(+), 11 deletions(-)