Message ID | 200905191100.14252.borntraeger@de.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Christian Bornträger wrote: >> To summarize, Anthony thinks it should use virtio, while I believe >> virtio is useful for exporting guest memory, not for importing host memory. >> > > I think the current virtio interface is not ideal for importing host memory, > but we can change that. If you look at the dcssblk driver for s390, it allows > a guest to map shared memory segments via a diagnose (hypercall). This driver > uses PCI regions to map memory. > > My point is, that the method to map memory is completely irrelevant, we just > need something like mmap/shmget between the guest and the host. We could > define an interface in virtio, that can be used by any transport. In case of > pci this could be a simple pci map operation. > > What do you think about something like: (CCed Rusty) > Exactly.
Avi Kivity wrote: > Christian Bornträger wrote: >>> To summarize, Anthony thinks it should use virtio, while I believe >>> virtio is useful for exporting guest memory, not for importing host >>> memory. >>> >> >> I think the current virtio interface is not ideal for importing host >> memory, but we can change that. If you look at the dcssblk driver for >> s390, it allows a guest to map shared memory segments via a diagnose >> (hypercall). This driver uses PCI regions to map memory. >> >> My point is, that the method to map memory is completely irrelevant, >> we just need something like mmap/shmget between the guest and the >> host. We could define an interface in virtio, that can be used by any >> transport. In case of pci this could be a simple pci map operation. >> What do you think about something like: (CCed Rusty) >> > > Exactly. > Agreed. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christian Bornträger wrote: > Am Montag 18 Mai 2009 16:26:15 schrieb Avi Kivity: > >> Christian Borntraeger wrote: >> >>> Sorry for the late question, but I missed your first version. Is there a >>> way to change that code to use virtio instead of PCI? That would allow us >>> to use this driver on s390 and maybe other virtio transports. >>> >> Opinion differs. See the discussion in >> http://article.gmane.org/gmane.comp.emulators.kvm.devel/30119. >> >> To summarize, Anthony thinks it should use virtio, while I believe >> virtio is useful for exporting guest memory, not for importing host memory. >> > > I think the current virtio interface is not ideal for importing host memory, > but we can change that. If you look at the dcssblk driver for s390, it allows > a guest to map shared memory segments via a diagnose (hypercall). This driver > uses PCI regions to map memory. > > My point is, that the method to map memory is completely irrelevant, we just > need something like mmap/shmget between the guest and the host. We could > define an interface in virtio, that can be used by any transport. In case of > pci this could be a simple pci map operation. > > What do you think about something like: (CCed Rusty) > --- > include/linux/virtio.h | 26 ++++++++++++++++++++++++++ > 1 file changed, 26 insertions(+) > > Index: linux-2.6/include/linux/virtio.h > =================================================================== > --- linux-2.6.orig/include/linux/virtio.h > +++ linux-2.6/include/linux/virtio.h > @@ -71,6 +71,31 @@ struct virtqueue_ops { > }; > > /** > + * virtio_device_ops - operations for virtio devices > + * @map_region: map host buffer at a given address > + * vdev: the struct virtio_device we're talking about. > + * addr: The address where the buffer should be mapped (hint only) > + * length: THe length of the mapping > + * identifier: the token that identifies the host buffer > + * Returns the mapping address or an error pointer. > + * @unmap_region: unmap host buffer from the address > + * vdev: the struct virtio_device we're talking about. > + * addr: The address where the buffer is mapped > + * Returns 0 on success or an error > + * > + * TBD, we might need query etc. > + */ > +struct virtio_device_ops { > + void * (*map_region)(struct virtio_device *vdev, > + void *addr, > + size_t length, > + int identifier); > + int (*unmap_region)(struct virtio_device *vdev, void *addr); > +/* we might need query region and other stuff */ > +}; > Perhaps something that maps closer to the current add_buf/get_buf API. Something like: struct iovec *(*map_buf)(struct virtqueue *vq, unsigned int *out_num, unsigned int *in_num); void (*unmap_buf)(struct virtqueue *vq, struct iovec *iov, unsigned int out_num, unsigned int in_num); There's symmetry here which is good. The one bad thing about it is forces certain memory to be read-only and other memory to be read-write. I don't see that as a bad thing though. I think we'll need an interface like this so support driver domains too since "backend". To put it another way, in QEMU, map_buf == virtqueue_pop and unmap_buf == virtqueue_push. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 20 May 2009 02:21:08 am Cam Macdonell wrote: > Avi Kivity wrote: > > Christian Bornträger wrote: > >>> To summarize, Anthony thinks it should use virtio, while I believe > >>> virtio is useful for exporting guest memory, not for importing host > >>> memory. Yes, precisely. But what's it *for*, this shared memory? Implementing shared memory is trivial. Using it is harder. For example, inter-guest networking: you'd have to copy packets in and out, making it slow as well as losing abstraction. The only interesting idea I can think of is exposing it to userspace, and having that run some protocol across it for fast app <-> app comms. But if that's your plan, you still have a lot of code the write! So I guess I'm missing the big picture here? Thanks, Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Mittwoch 20 Mai 2009 04:58:38 schrieb Rusty Russell: > On Wed, 20 May 2009 02:21:08 am Cam Macdonell wrote: > > Avi Kivity wrote: > > > Christian Bornträger wrote: > > >>> To summarize, Anthony thinks it should use virtio, while I believe > > >>> virtio is useful for exporting guest memory, not for importing host > > >>> memory. > > Yes, precisely. > > But what's it *for*, this shared memory? Implementing shared memory is > trivial. Using it is harder. For example, inter-guest networking: you'd > have to copy packets in and out, making it slow as well as losing > abstraction. > > The only interesting idea I can think of is exposing it to userspace, and > having that run some protocol across it for fast app <-> app comms. But if > that's your plan, you still have a lot of code the write! > > So I guess I'm missing the big picture here? I can give some insights about shared memory usage in z/VM. z/VM uses so- called discontiguous saved segments (DCSS) to shared memory between guests. (naming side note: o discontigous because these segments can have holes and different access rights, e.g. you can build DCSS that go from 800M-801M read only and 900M-910M exclusive-write. o segments because the 2nd level of our page tables is called segment table. ) z/VM uses these segments for several purposes: o The monitoring subsystem uses a DCSS to get data from several components o shared guest kernels: The CMS operating system is build as a bootable DCSS (called named-saved-segments NSS). All guests have the same host pages for the read-only parts of the CMS kernel. The local data is stored in exclusive-write parts of the same NSS. Linux on System z is also capable of using this feature (CONFIG_SHARED_KERNEL). The kernel linkage is changed in a way to separate the read-only text segment from the other parts with segment size alignment o execute-in-place: This is a Linux feature to exploit the DCSS technology. The goal is to shared identical guest pages without the additional overhead of KSM etc. We have a block device driver for DCSS. This block device driver supports the direct_access function and therefore allows to use the xip option of ext2. The idea is to put binaries into an read-only ext2 filesystem. Whenever an mmap is made on this file system, the page is not mapped into the page cache. The ptes point into the DCSS memory instead. Since the DCSS is demand-paged by the host no memory is wasted for unused parts of the binaries. In case of COW the page is copied as usual. It turned out that installations with many similar guests (lets say 400 guests) will profit in terms of memory saving and quicker application startups (not the first guest of course). There is a downside: this requires a skilled administrator to setup. We have also experimented with network, Posix shared memory, and shared caches via DCSS. Most of these ideas turned out to be not very useful or hard to implement proper. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Dienstag 19 Mai 2009 20:39:24 schrieb Anthony Liguori: > Perhaps something that maps closer to the current add_buf/get_buf API. > Something like: > > struct iovec *(*map_buf)(struct virtqueue *vq, unsigned int *out_num, > unsigned int *in_num); > void (*unmap_buf)(struct virtqueue *vq, struct iovec *iov, unsigned int > out_num, unsigned int in_num); > > There's symmetry here which is good. The one bad thing about it is > forces certain memory to be read-only and other memory to be > read-write. I don't see that as a bad thing though. > > I think we'll need an interface like this so support driver domains too > since "backend". To put it another way, in QEMU, map_buf == > virtqueue_pop and unmap_buf == virtqueue_push. You are proposing that the guest should define some guest memory to be used as shared memory (some kind of replacement), right? This is fine, as long as we can _also_ map host memory somewhere else (e.g. after guest memory, above 1TB etc.). I definitely want to be able to have an 64MB guest map an 2GB shared memory zone. (See my other mail about the execute-in-place via DCSS use case). I think we should start to write down some requirements. This will help to get a better understanding of the necessary interface: here are my first ideas: o allow to map host-shared-memory to anyplace that can be addressed via a PFN o allow to map beyond guest storage o allow to replace guest memory o read-only and read/write modes o driver interface should not depend on hardware specific stuff (e.g. prefer generic virtio over PCI) More ideas are welcome. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 20, 2009 at 4:58 AM, Rusty Russell <rusty@rustcorp.com.au> wrote: > The only interesting idea I can think of is exposing it to userspace, and > having that run some protocol across it for fast app <-> app comms. But if > that's your plan, you still have a lot of code the write! > > So I guess I'm missing the big picture here? Hello Rusty, For an example, you may have a look at a paper I wrote last year on achieving fast MPI-like message passing between guests over shared memory [1]. For my proof-of-concept implementation, I introduced a virtual device allowing to perform DMA between guests, something for which virtio is well suited, but also to share some memory to transfer small messages efficiently from userspace. To expose this shared memory to guests, I implemented something quite similar to what Cam is proposing which was to expose it as the memory of a pci device. I think it could be a useful addition to virtio if it allowed to abstract this. François [1] http://hal.archives-ouvertes.fr/docs/00/36/86/22/PDF/vhpc08.pdf -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christian Bornträger wrote: > o shared guest kernels: The CMS operating system is build as a bootable DCSS > (called named-saved-segments NSS). All guests have the same host pages for > the read-only parts of the CMS kernel. The local data is stored in > exclusive-write parts of the same NSS. Linux on System z is also capable of > using this feature (CONFIG_SHARED_KERNEL). The kernel linkage is changed in > a way to separate the read-only text segment from the other parts with > segment size alignment > How does patching (smp, kprobes/jprobes, markers/ftrace) work with this? > o execute-in-place: This is a Linux feature to exploit the DCSS technology. > The goal is to shared identical guest pages without the additional overhead > of KSM etc. We have a block device driver for DCSS. This block device driver > supports the direct_access function and therefore allows to use the xip > option of ext2. The idea is to put binaries into an read-only ext2 > filesystem. Whenever an mmap is made on this file system, the page is not > mapped into the page cache. The ptes point into the DCSS memory instead. > Since the DCSS is demand-paged by the host no memory is wasted for unused > parts of the binaries. In case of COW the page is copied as usual. It turned > out that installations with many similar guests (lets say 400 guests) will > profit in terms of memory saving and quicker application startups (not the > first guest of course). There is a downside: this requires a skilled > administrator to setup. ksm might be easier to admin, at the cost of some cpu time.
Am Mittwoch 20 Mai 2009 10:45:50 schrieb Avi Kivity: > Christian Bornträger wrote: > > o shared guest kernels: The CMS operating system is build as a bootable > > DCSS (called named-saved-segments NSS). All guests have the same host > > pages for the read-only parts of the CMS kernel. The local data is stored > > in exclusive-write parts of the same NSS. Linux on System z is also > > capable of using this feature (CONFIG_SHARED_KERNEL). The kernel linkage > > is changed in a way to separate the read-only text segment from the other > > parts with segment size alignment > > How does patching (smp, kprobes/jprobes, markers/ftrace) work with this? It does not. :-) Because of that and since most distro kernels are fully modular and kernel updates are another problem this feature is not used very often for Linux. It is used heavily in CMS, though. Actually, we could do COW in the host but then it is really not worth the effort. > > o execute-in-place: This is a Linux feature to exploit the DCSS > > technology. The goal is to shared identical guest pages without the > > additional overhead of KSM etc. We have a block device driver for DCSS. > > This block device driver supports the direct_access function and > > therefore allows to use the xip option of ext2. The idea is to put > > binaries into an read-only ext2 filesystem. Whenever an mmap is made on > > this file system, the page is not mapped into the page cache. The ptes > > point into the DCSS memory instead. Since the DCSS is demand-paged by the > > host no memory is wasted for unused parts of the binaries. In case of COW > > the page is copied as usual. It turned out that installations with many > > similar guests (lets say 400 guests) will profit in terms of memory > > saving and quicker application startups (not the first guest of course). > > There is a downside: this requires a skilled administrator to setup. > > ksm might be easier to admin, at the cost of some cpu time. Yes, KSM is easier and it even finds duplicate data pages. On the other hand it does only provide memory saving. It does not speedup application startup like execute-in-place (major page faults become minor page faults for text pages if the page is already backed by the host) I am not claiming that KSM is useless. Depending on the scenario you might want the one or the other or even both. For typical desktop use, KSM is very likely the better approach. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christian Bornträger wrote: > Am Mittwoch 20 Mai 2009 10:45:50 schrieb Avi Kivity: > >> Christian Bornträger wrote: >> >>> o shared guest kernels: The CMS operating system is build as a bootable >>> DCSS (called named-saved-segments NSS). All guests have the same host >>> pages for the read-only parts of the CMS kernel. The local data is stored >>> in exclusive-write parts of the same NSS. Linux on System z is also >>> capable of using this feature (CONFIG_SHARED_KERNEL). The kernel linkage >>> is changed in a way to separate the read-only text segment from the other >>> parts with segment size alignment >>> >> How does patching (smp, kprobes/jprobes, markers/ftrace) work with this? >> > It does not. :-) > Because of that and since most distro kernels are fully modular and kernel > updates are another problem this feature is not used very often for Linux. It > is used heavily in CMS, though. > Actually, we could do COW in the host but then it is really not worth the > effort. > ksm on low throttle would solve all of those problems. > Yes, KSM is easier and it even finds duplicate data pages. > On the other hand it does only provide memory saving. It does not speedup > application startup like execute-in-place (major page faults become minor page > faults for text pages if the page is already backed by the host) > I am not claiming that KSM is useless. Depending on the scenario you might > want the one or the other or even both. For typical desktop use, KSM is very > likely the better approach If ksm shares pagecache, then doesn't it become effectively XIP? We could also hook virtio dma to preemptively share pages somehow.
Am Mittwoch 20 Mai 2009 11:11:57 schrieb Avi Kivity: > > Yes, KSM is easier and it even finds duplicate data pages. > > On the other hand it does only provide memory saving. It does not speedup > > application startup like execute-in-place (major page faults become minor > > page faults for text pages if the page is already backed by the host) I > > am not claiming that KSM is useless. Depending on the scenario you might > > want the one or the other or even both. For typical desktop use, KSM is > > very likely the better approach > > If ksm shares pagecache, then doesn't it become effectively XIP? Not exactly, only for long running guests with stable working set. If the guest boots up, its page cache is basically empty, but the shared segment is populated. its the startup where xip wins. Same is true for guests with quickly changing working sets. > We could also hook virtio dma to preemptively share pages somehow. Yes, that is something to think about. One idea that is used on z/VM by lot of customers is to have a shared disk read-only for /usr that is cached by the host. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christian Bornträger wrote: > Am Dienstag 19 Mai 2009 20:39:24 schrieb Anthony Liguori: > >> Perhaps something that maps closer to the current add_buf/get_buf API. >> Something like: >> >> struct iovec *(*map_buf)(struct virtqueue *vq, unsigned int *out_num, >> unsigned int *in_num); >> void (*unmap_buf)(struct virtqueue *vq, struct iovec *iov, unsigned int >> out_num, unsigned int in_num); >> >> There's symmetry here which is good. The one bad thing about it is >> forces certain memory to be read-only and other memory to be >> read-write. I don't see that as a bad thing though. >> >> I think we'll need an interface like this so support driver domains too >> since "backend". To put it another way, in QEMU, map_buf == >> virtqueue_pop and unmap_buf == virtqueue_push. >> > > > You are proposing that the guest should define some guest memory to be used as > shared memory (some kind of replacement), right? No. map_buf() returns a mapped region of memory. Where that memory comes from is up to the transport. It can be the result of an ioremap of a PCI BAR. The model of virtio frontends today is: o add buffer of guest's memory o let backend do something with it o get back buffer of guest's memory The backend model (as implemented by QEMU) is: o get buffer of mapped front-end memory o do something with memory o give buffer back For implementing persistent shared memory, you need a vring with enough elements to hold all of the shared memory regions at one time. This becomes more practical with indirect scatter/gather entries. Of course, whether vring is used at all is a transport detail. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 20 May 2009 05:03:01 pm Christian Bornträger wrote: > Am Mittwoch 20 Mai 2009 04:58:38 schrieb Rusty Russell: > > But what's it *for*, this shared memory? ... > z/VM uses these segments for several purposes: > o The monitoring subsystem uses a DCSS to get data from several components In KVM this probably doesn't require inter-guest access; presumably monitoring is done on the host. > o shared guest kernels: The CMS operating system is build as a bootable > DCSS (called named-saved-segments NSS). All guests have the same host pages > for the read-only parts of the CMS kernel. The local data is stored in > exclusive-write parts of the same NSS. Linux on System z is also capable of > using this feature (CONFIG_SHARED_KERNEL). The kernel linkage is changed in > a way to separate the read-only text segment from the other parts with > segment size alignment This is unlikely for x86 at least, and as you point out, not good for distributions either. > o execute-in-place: This is a Linux feature to exploit the DCSS technology. > The goal is to shared identical guest pages without the additional > overhead of KSM etc. We have a block device driver for DCSS. This block > device driver supports the direct_access function and therefore allows to > use the xip option of ext2. The idea is to put binaries into an read-only > ext2 filesystem. Whenever an mmap is made on this file system, the page is > not mapped into the page cache. The ptes point into the DCSS memory > instead. Since the DCSS is demand-paged by the host no memory is wasted for > unused parts of the binaries. In case of COW the page is copied as usual. > It turned out that installations with many similar guests (lets say 400 > guests) will profit in terms of memory saving and quicker application > startups (not the first guest of course). There is a downside: this > requires a skilled administrator to setup. We're better off doing opportunistic KSM in virtio_blk I'd say. Anyway, it's not really "inter-guest" in this sense; the host controls it, though it lets multiple guests read from it. > We have also experimented with network, Posix shared memory, and shared > caches via DCSS. Most of these ideas turned out to be not very useful or > hard to implement proper. Indeed, and this is what I suspect these patches are aiming for... Thanks, Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: linux-2.6/include/linux/virtio.h =================================================================== --- linux-2.6.orig/include/linux/virtio.h +++ linux-2.6/include/linux/virtio.h @@ -71,6 +71,31 @@ struct virtqueue_ops { }; /** + * virtio_device_ops - operations for virtio devices + * @map_region: map host buffer at a given address + * vdev: the struct virtio_device we're talking about. + * addr: The address where the buffer should be mapped (hint only) + * length: THe length of the mapping + * identifier: the token that identifies the host buffer + * Returns the mapping address or an error pointer. + * @unmap_region: unmap host buffer from the address + * vdev: the struct virtio_device we're talking about. + * addr: The address where the buffer is mapped + * Returns 0 on success or an error + * + * TBD, we might need query etc. + */ +struct virtio_device_ops { + void * (*map_region)(struct virtio_device *vdev, + void *addr, + size_t length, + int identifier); + int (*unmap_region)(struct virtio_device *vdev, void *addr); +/* we might need query region and other stuff */ +}; + + +/** * virtio_device - representation of a device using virtio * @index: unique position on the virtio bus * @dev: underlying device. @@ -85,6 +110,7 @@ struct virtio_device struct device dev; struct virtio_device_id id; struct virtio_config_ops *config; + struct virtio_device_ops *ops; /* Note that this is a Linux set_bit-style bitmap. */ unsigned long features[1]; void *priv;