Message ID | 1434472419-148742-4-git-send-email-imammedo@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote: > since commit > 1d4e7e3 kvm: x86: increase user memory slots to 509 > > it became possible to use a bigger amount of memory > slots, which is used by memory hotplug for > registering hotplugged memory. > However QEMU crashes if it's used with more than ~60 > pc-dimm devices and vhost-net since host kernel > in module vhost-net refuses to accept more than 65 > memory regions. > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509 It was 64, not 65. > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net. > > Signed-off-by: Igor Mammedov <imammedo@redhat.com> Still thinking about this: can you reorder this to be the last patch in the series please? Also - 509? I think if we are changing this, it'd be nice to create a way for userspace to discover the support and the # of regions supported. > --- > drivers/vhost/vhost.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index 99931a0..6a18c92 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -30,7 +30,7 @@ > #include "vhost.h" > > enum { > - VHOST_MEMORY_MAX_NREGIONS = 64, > + VHOST_MEMORY_MAX_NREGIONS = 509, > VHOST_MEMORY_F_LOG = 0x1, > }; > > -- > 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 16 Jun 2015 23:14:20 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote: > > since commit > > 1d4e7e3 kvm: x86: increase user memory slots to 509 > > > > it became possible to use a bigger amount of memory > > slots, which is used by memory hotplug for > > registering hotplugged memory. > > However QEMU crashes if it's used with more than ~60 > > pc-dimm devices and vhost-net since host kernel > > in module vhost-net refuses to accept more than 65 > > memory regions. > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509 > > It was 64, not 65. > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net. > > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com> > > Still thinking about this: can you reorder this to > be the last patch in the series please? sure > > Also - 509? userspace memory slots in terms of KVM, I made it match KVM's allotment of memory slots for userspace side. > I think if we are changing this, it'd be nice to > create a way for userspace to discover the support > and the # of regions supported. That was my first idea before extending KVM's memslots to teach kernel to tell qemu this number so that QEMU at least would be able to check if new memory slot could be added but I was redirected to a more simple solution of just extending vs everdoing things. Currently QEMU supports upto ~250 memslots so 509 is about twice high we need it so it should work for near future but eventually we might still teach kernel and QEMU to make things more robust. > > > > --- > > drivers/vhost/vhost.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > index 99931a0..6a18c92 100644 > > --- a/drivers/vhost/vhost.c > > +++ b/drivers/vhost/vhost.c > > @@ -30,7 +30,7 @@ > > #include "vhost.h" > > > > enum { > > - VHOST_MEMORY_MAX_NREGIONS = 64, > > + VHOST_MEMORY_MAX_NREGIONS = 509, > > VHOST_MEMORY_F_LOG = 0x1, > > }; > > > > -- > > 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote: > On Tue, 16 Jun 2015 23:14:20 +0200 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote: > > > since commit > > > 1d4e7e3 kvm: x86: increase user memory slots to 509 > > > > > > it became possible to use a bigger amount of memory > > > slots, which is used by memory hotplug for > > > registering hotplugged memory. > > > However QEMU crashes if it's used with more than ~60 > > > pc-dimm devices and vhost-net since host kernel > > > in module vhost-net refuses to accept more than 65 > > > memory regions. > > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509 > > > > It was 64, not 65. > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net. > > > > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com> > > > > Still thinking about this: can you reorder this to > > be the last patch in the series please? > sure > > > > > Also - 509? > userspace memory slots in terms of KVM, I made it match > KVM's allotment of memory slots for userspace side. Maybe KVM has its reasons for this #. I don't see why we need to match this exactly. > > I think if we are changing this, it'd be nice to > > create a way for userspace to discover the support > > and the # of regions supported. > That was my first idea before extending KVM's memslots > to teach kernel to tell qemu this number so that QEMU > at least would be able to check if new memory slot could > be added but I was redirected to a more simple solution > of just extending vs everdoing things. > Currently QEMU supports upto ~250 memslots so 509 > is about twice high we need it so it should work for near > future Yes but old kernels are still around. Would be nice if you can detect them. > but eventually we might still teach kernel and QEMU > to make things more robust. A new ioctl would be easy to add, I think it's a good idea generally. > > > > > > > --- > > > drivers/vhost/vhost.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > > index 99931a0..6a18c92 100644 > > > --- a/drivers/vhost/vhost.c > > > +++ b/drivers/vhost/vhost.c > > > @@ -30,7 +30,7 @@ > > > #include "vhost.h" > > > > > > enum { > > > - VHOST_MEMORY_MAX_NREGIONS = 64, > > > + VHOST_MEMORY_MAX_NREGIONS = 509, > > > VHOST_MEMORY_F_LOG = 0x1, > > > }; > > > > > > -- > > > 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 17 Jun 2015 08:34:26 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote: > > On Tue, 16 Jun 2015 23:14:20 +0200 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote: > > > > since commit > > > > 1d4e7e3 kvm: x86: increase user memory slots to 509 > > > > > > > > it became possible to use a bigger amount of memory > > > > slots, which is used by memory hotplug for > > > > registering hotplugged memory. > > > > However QEMU crashes if it's used with more than ~60 > > > > pc-dimm devices and vhost-net since host kernel > > > > in module vhost-net refuses to accept more than 65 > > > > memory regions. > > > > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509 > > > > > > It was 64, not 65. > > > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net. > > > > > > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com> > > > > > > Still thinking about this: can you reorder this to > > > be the last patch in the series please? > > sure > > > > > > > > Also - 509? > > userspace memory slots in terms of KVM, I made it match > > KVM's allotment of memory slots for userspace side. > > Maybe KVM has its reasons for this #. I don't see > why we need to match this exactly. np, I can cap it at safe 300 slots but it's unlikely that it would take cut off 1 extra hop since it's capped by QEMU at 256+[initial fragmented memory] > > > > I think if we are changing this, it'd be nice to > > > create a way for userspace to discover the support > > > and the # of regions supported. > > That was my first idea before extending KVM's memslots > > to teach kernel to tell qemu this number so that QEMU > > at least would be able to check if new memory slot could > > be added but I was redirected to a more simple solution > > of just extending vs everdoing things. > > Currently QEMU supports upto ~250 memslots so 509 > > is about twice high we need it so it should work for near > > future > > Yes but old kernels are still around. Would be nice if you > can detect them. > > > but eventually we might still teach kernel and QEMU > > to make things more robust. > > A new ioctl would be easy to add, I think it's a good > idea generally. I can try to do something like this on top of this series. > > > > > > > > > > > --- > > > > drivers/vhost/vhost.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > > > index 99931a0..6a18c92 100644 > > > > --- a/drivers/vhost/vhost.c > > > > +++ b/drivers/vhost/vhost.c > > > > @@ -30,7 +30,7 @@ > > > > #include "vhost.h" > > > > > > > > enum { > > > > - VHOST_MEMORY_MAX_NREGIONS = 64, > > > > + VHOST_MEMORY_MAX_NREGIONS = 509, > > > > VHOST_MEMORY_F_LOG = 0x1, > > > > }; > > > > > > > > -- > > > > 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote: > On Wed, 17 Jun 2015 08:34:26 +0200 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote: > > > On Tue, 16 Jun 2015 23:14:20 +0200 > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote: > > > > > since commit > > > > > 1d4e7e3 kvm: x86: increase user memory slots to 509 > > > > > > > > > > it became possible to use a bigger amount of memory > > > > > slots, which is used by memory hotplug for > > > > > registering hotplugged memory. > > > > > However QEMU crashes if it's used with more than ~60 > > > > > pc-dimm devices and vhost-net since host kernel > > > > > in module vhost-net refuses to accept more than 65 > > > > > memory regions. > > > > > > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509 > > > > > > > > It was 64, not 65. > > > > > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net. > > > > > > > > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com> > > > > > > > > Still thinking about this: can you reorder this to > > > > be the last patch in the series please? > > > sure > > > > > > > > > > > Also - 509? > > > userspace memory slots in terms of KVM, I made it match > > > KVM's allotment of memory slots for userspace side. > > > > Maybe KVM has its reasons for this #. I don't see > > why we need to match this exactly. > np, I can cap it at safe 300 slots but it's unlikely that it > would take cut off 1 extra hop since it's capped by QEMU > at 256+[initial fragmented memory] But what's the point? We allocate 32 bytes per slot. 300*32 = 9600 which is more than 8K, so we are doing an order-3 allocation anyway. If we could cap it at 8K (256 slots) that would make sense since we could avoid wasting vmalloc space. I'm still not very happy with the whole approach, giving userspace ability allocate 4 whole pages of kernel memory like this. > > > > > > I think if we are changing this, it'd be nice to > > > > create a way for userspace to discover the support > > > > and the # of regions supported. > > > That was my first idea before extending KVM's memslots > > > to teach kernel to tell qemu this number so that QEMU > > > at least would be able to check if new memory slot could > > > be added but I was redirected to a more simple solution > > > of just extending vs everdoing things. > > > Currently QEMU supports upto ~250 memslots so 509 > > > is about twice high we need it so it should work for near > > > future > > > > Yes but old kernels are still around. Would be nice if you > > can detect them. > > > > > but eventually we might still teach kernel and QEMU > > > to make things more robust. > > > > A new ioctl would be easy to add, I think it's a good > > idea generally. > I can try to do something like this on top of this series. > > > > > > > > > > > > > > > > --- > > > > > drivers/vhost/vhost.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > > > > index 99931a0..6a18c92 100644 > > > > > --- a/drivers/vhost/vhost.c > > > > > +++ b/drivers/vhost/vhost.c > > > > > @@ -30,7 +30,7 @@ > > > > > #include "vhost.h" > > > > > > > > > > enum { > > > > > - VHOST_MEMORY_MAX_NREGIONS = 64, > > > > > + VHOST_MEMORY_MAX_NREGIONS = 509, > > > > > VHOST_MEMORY_F_LOG = 0x1, > > > > > }; > > > > > > > > > > -- > > > > > 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 17/06/2015 08:34, Michael S. Tsirkin wrote: >>> > > >>> > > Also - 509? >> > userspace memory slots in terms of KVM, I made it match >> > KVM's allotment of memory slots for userspace side. > Maybe KVM has its reasons for this #. Nice power of two (512) - number of reserved slots required by Intel's virtualization extensions (3: APIC access page, EPT identity page table, VMX task state segment). Paolo I don't see > why we need to match this exactly. > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 17 Jun 2015 09:39:06 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote: > > On Wed, 17 Jun 2015 08:34:26 +0200 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote: > > > > On Tue, 16 Jun 2015 23:14:20 +0200 > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote: > > > > > > since commit > > > > > > 1d4e7e3 kvm: x86: increase user memory slots to 509 > > > > > > > > > > > > it became possible to use a bigger amount of memory > > > > > > slots, which is used by memory hotplug for > > > > > > registering hotplugged memory. > > > > > > However QEMU crashes if it's used with more than ~60 > > > > > > pc-dimm devices and vhost-net since host kernel > > > > > > in module vhost-net refuses to accept more than 65 > > > > > > memory regions. > > > > > > > > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509 > > > > > > > > > > It was 64, not 65. > > > > > > > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net. > > > > > > > > > > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com> > > > > > > > > > > Still thinking about this: can you reorder this to > > > > > be the last patch in the series please? > > > > sure > > > > > > > > > > > > > > Also - 509? > > > > userspace memory slots in terms of KVM, I made it match > > > > KVM's allotment of memory slots for userspace side. > > > > > > Maybe KVM has its reasons for this #. I don't see > > > why we need to match this exactly. > > np, I can cap it at safe 300 slots but it's unlikely that it > > would take cut off 1 extra hop since it's capped by QEMU > > at 256+[initial fragmented memory] > > But what's the point? We allocate 32 bytes per slot. > 300*32 = 9600 which is more than 8K, so we are doing > an order-3 allocation anyway. > If we could cap it at 8K (256 slots) that would make sense > since we could avoid wasting vmalloc space. 256 is amount of hotpluggable slots and there is no way to predict how initial memory would be fragmented (i.e. amount of slots it would take), if we guess wrong we are back to square one with crashing userspace. So I'd stay consistent with KVM's limit 509 since it's only limit, i.e. not actual amount of allocated slots. > I'm still not very happy with the whole approach, > giving userspace ability allocate 4 whole pages > of kernel memory like this. I'm working in parallel so that userspace won't take so many slots but it won't prevent its current versions crashing due to kernel limitation. > > > > > I think if we are changing this, it'd be nice to > > > > > create a way for userspace to discover the support > > > > > and the # of regions supported. > > > > That was my first idea before extending KVM's memslots > > > > to teach kernel to tell qemu this number so that QEMU > > > > at least would be able to check if new memory slot could > > > > be added but I was redirected to a more simple solution > > > > of just extending vs everdoing things. > > > > Currently QEMU supports upto ~250 memslots so 509 > > > > is about twice high we need it so it should work for near > > > > future > > > > > > Yes but old kernels are still around. Would be nice if you > > > can detect them. > > > > > > > but eventually we might still teach kernel and QEMU > > > > to make things more robust. > > > > > > A new ioctl would be easy to add, I think it's a good > > > idea generally. > > I can try to do something like this on top of this series. > > > > > > > > > > > > > > > > > > > > > --- > > > > > > drivers/vhost/vhost.c | 2 +- > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > > > > > index 99931a0..6a18c92 100644 > > > > > > --- a/drivers/vhost/vhost.c > > > > > > +++ b/drivers/vhost/vhost.c > > > > > > @@ -30,7 +30,7 @@ > > > > > > #include "vhost.h" > > > > > > > > > > > > enum { > > > > > > - VHOST_MEMORY_MAX_NREGIONS = 64, > > > > > > + VHOST_MEMORY_MAX_NREGIONS = 509, > > > > > > VHOST_MEMORY_F_LOG = 0x1, > > > > > > }; > > > > > > > > > > > > -- > > > > > > 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 17, 2015 at 10:54:21AM +0200, Igor Mammedov wrote: > On Wed, 17 Jun 2015 09:39:06 +0200 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote: > > > On Wed, 17 Jun 2015 08:34:26 +0200 > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote: > > > > > On Tue, 16 Jun 2015 23:14:20 +0200 > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote: > > > > > > > since commit > > > > > > > 1d4e7e3 kvm: x86: increase user memory slots to 509 > > > > > > > > > > > > > > it became possible to use a bigger amount of memory > > > > > > > slots, which is used by memory hotplug for > > > > > > > registering hotplugged memory. > > > > > > > However QEMU crashes if it's used with more than ~60 > > > > > > > pc-dimm devices and vhost-net since host kernel > > > > > > > in module vhost-net refuses to accept more than 65 > > > > > > > memory regions. > > > > > > > > > > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509 > > > > > > > > > > > > It was 64, not 65. > > > > > > > > > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net. > > > > > > > > > > > > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com> > > > > > > > > > > > > Still thinking about this: can you reorder this to > > > > > > be the last patch in the series please? > > > > > sure > > > > > > > > > > > > > > > > > Also - 509? > > > > > userspace memory slots in terms of KVM, I made it match > > > > > KVM's allotment of memory slots for userspace side. > > > > > > > > Maybe KVM has its reasons for this #. I don't see > > > > why we need to match this exactly. > > > np, I can cap it at safe 300 slots but it's unlikely that it > > > would take cut off 1 extra hop since it's capped by QEMU > > > at 256+[initial fragmented memory] > > > > But what's the point? We allocate 32 bytes per slot. > > 300*32 = 9600 which is more than 8K, so we are doing > > an order-3 allocation anyway. > > If we could cap it at 8K (256 slots) that would make sense > > since we could avoid wasting vmalloc space. > 256 is amount of hotpluggable slots and there is no way > to predict how initial memory would be fragmented > (i.e. amount of slots it would take), if we guess wrong > we are back to square one with crashing userspace. > So I'd stay consistent with KVM's limit 509 since > it's only limit, i.e. not actual amount of allocated slots. > > > I'm still not very happy with the whole approach, > > giving userspace ability allocate 4 whole pages > > of kernel memory like this. > I'm working in parallel so that userspace won't take so > many slots but it won't prevent its current versions > crashing due to kernel limitation. Right but at least it's not a regression. If we promise userspace to support a ton of regions, we can't take it back later, and I'm concerned about the memory usage. I think it's already safe to merge the binary lookup patches, and maybe cache and vmalloc, so that the remaining patch will be small. > > > > > > > I think if we are changing this, it'd be nice to > > > > > > create a way for userspace to discover the support > > > > > > and the # of regions supported. > > > > > That was my first idea before extending KVM's memslots > > > > > to teach kernel to tell qemu this number so that QEMU > > > > > at least would be able to check if new memory slot could > > > > > be added but I was redirected to a more simple solution > > > > > of just extending vs everdoing things. > > > > > Currently QEMU supports upto ~250 memslots so 509 > > > > > is about twice high we need it so it should work for near > > > > > future > > > > > > > > Yes but old kernels are still around. Would be nice if you > > > > can detect them. > > > > > > > > > but eventually we might still teach kernel and QEMU > > > > > to make things more robust. > > > > > > > > A new ioctl would be easy to add, I think it's a good > > > > idea generally. > > > I can try to do something like this on top of this series. > > > > > > > > > > > > > > > > > > > > > > > > > > --- > > > > > > > drivers/vhost/vhost.c | 2 +- > > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > > > > > > index 99931a0..6a18c92 100644 > > > > > > > --- a/drivers/vhost/vhost.c > > > > > > > +++ b/drivers/vhost/vhost.c > > > > > > > @@ -30,7 +30,7 @@ > > > > > > > #include "vhost.h" > > > > > > > > > > > > > > enum { > > > > > > > - VHOST_MEMORY_MAX_NREGIONS = 64, > > > > > > > + VHOST_MEMORY_MAX_NREGIONS = 509, > > > > > > > VHOST_MEMORY_F_LOG = 0x1, > > > > > > > }; > > > > > > > > > > > > > > -- > > > > > > > 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 17 Jun 2015 12:11:09 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Wed, Jun 17, 2015 at 10:54:21AM +0200, Igor Mammedov wrote: > > On Wed, 17 Jun 2015 09:39:06 +0200 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote: > > > > On Wed, 17 Jun 2015 08:34:26 +0200 > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote: > > > > > > On Tue, 16 Jun 2015 23:14:20 +0200 > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote: > > > > > > > > since commit > > > > > > > > 1d4e7e3 kvm: x86: increase user memory slots to 509 > > > > > > > > > > > > > > > > it became possible to use a bigger amount of memory > > > > > > > > slots, which is used by memory hotplug for > > > > > > > > registering hotplugged memory. > > > > > > > > However QEMU crashes if it's used with more than ~60 > > > > > > > > pc-dimm devices and vhost-net since host kernel > > > > > > > > in module vhost-net refuses to accept more than 65 > > > > > > > > memory regions. > > > > > > > > > > > > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509 > > > > > > > > > > > > > > It was 64, not 65. > > > > > > > > > > > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net. > > > > > > > > > > > > > > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com> > > > > > > > > > > > > > > Still thinking about this: can you reorder this to > > > > > > > be the last patch in the series please? > > > > > > sure > > > > > > > > > > > > > > > > > > > > Also - 509? > > > > > > userspace memory slots in terms of KVM, I made it match > > > > > > KVM's allotment of memory slots for userspace side. > > > > > > > > > > Maybe KVM has its reasons for this #. I don't see > > > > > why we need to match this exactly. > > > > np, I can cap it at safe 300 slots but it's unlikely that it > > > > would take cut off 1 extra hop since it's capped by QEMU > > > > at 256+[initial fragmented memory] > > > > > > But what's the point? We allocate 32 bytes per slot. > > > 300*32 = 9600 which is more than 8K, so we are doing > > > an order-3 allocation anyway. > > > If we could cap it at 8K (256 slots) that would make sense > > > since we could avoid wasting vmalloc space. > > 256 is amount of hotpluggable slots and there is no way > > to predict how initial memory would be fragmented > > (i.e. amount of slots it would take), if we guess wrong > > we are back to square one with crashing userspace. > > So I'd stay consistent with KVM's limit 509 since > > it's only limit, i.e. not actual amount of allocated slots. > > > > > I'm still not very happy with the whole approach, > > > giving userspace ability allocate 4 whole pages > > > of kernel memory like this. > > I'm working in parallel so that userspace won't take so > > many slots but it won't prevent its current versions > > crashing due to kernel limitation. > > Right but at least it's not a regression. If we promise userspace to > support a ton of regions, we can't take it back later, and I'm concerned > about the memory usage. > > I think it's already safe to merge the binary lookup patches, and maybe > cache and vmalloc, so that the remaining patch will be small. it isn't regression with switching to binary search and increasing slots to 509 either performance wise it's more on improvment side. And I was thinking about memory usage as well, that's why I've dropped faster radix tree in favor of more compact array, can't do better on kernel side of fix. Yes we will give userspace to ability to use more slots/and lock up more memory if it's not able to consolidate memory regions but that leaves an option for user to run guest with vhost performance vs crashing it at runtime. userspace/targets that could consolidate memory regions should do so and I'm working on that as well but that doesn't mean that users shouldn't have a choice. So far it's kernel limitation and this patch fixes crashes that users see now, with the rest of patches enabling performance not to regress. > > > > > > > > > > I think if we are changing this, it'd be nice to > > > > > > > create a way for userspace to discover the support > > > > > > > and the # of regions supported. > > > > > > That was my first idea before extending KVM's memslots > > > > > > to teach kernel to tell qemu this number so that QEMU > > > > > > at least would be able to check if new memory slot could > > > > > > be added but I was redirected to a more simple solution > > > > > > of just extending vs everdoing things. > > > > > > Currently QEMU supports upto ~250 memslots so 509 > > > > > > is about twice high we need it so it should work for near > > > > > > future > > > > > > > > > > Yes but old kernels are still around. Would be nice if you > > > > > can detect them. > > > > > > > > > > > but eventually we might still teach kernel and QEMU > > > > > > to make things more robust. > > > > > > > > > > A new ioctl would be easy to add, I think it's a good > > > > > idea generally. > > > > I can try to do something like this on top of this series. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- > > > > > > > > drivers/vhost/vhost.c | 2 +- > > > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > > > > > > > index 99931a0..6a18c92 100644 > > > > > > > > --- a/drivers/vhost/vhost.c > > > > > > > > +++ b/drivers/vhost/vhost.c > > > > > > > > @@ -30,7 +30,7 @@ > > > > > > > > #include "vhost.h" > > > > > > > > > > > > > > > > enum { > > > > > > > > - VHOST_MEMORY_MAX_NREGIONS = 64, > > > > > > > > + VHOST_MEMORY_MAX_NREGIONS = 509, > > > > > > > > VHOST_MEMORY_F_LOG = 0x1, > > > > > > > > }; > > > > > > > > > > > > > > > > -- > > > > > > > > 1.8.3.1 > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 17, 2015 at 12:37:42PM +0200, Igor Mammedov wrote: > On Wed, 17 Jun 2015 12:11:09 +0200 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Wed, Jun 17, 2015 at 10:54:21AM +0200, Igor Mammedov wrote: > > > On Wed, 17 Jun 2015 09:39:06 +0200 > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote: > > > > > On Wed, 17 Jun 2015 08:34:26 +0200 > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > > > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote: > > > > > > > On Tue, 16 Jun 2015 23:14:20 +0200 > > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > > > > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote: > > > > > > > > > since commit > > > > > > > > > 1d4e7e3 kvm: x86: increase user memory slots to 509 > > > > > > > > > > > > > > > > > > it became possible to use a bigger amount of memory > > > > > > > > > slots, which is used by memory hotplug for > > > > > > > > > registering hotplugged memory. > > > > > > > > > However QEMU crashes if it's used with more than ~60 > > > > > > > > > pc-dimm devices and vhost-net since host kernel > > > > > > > > > in module vhost-net refuses to accept more than 65 > > > > > > > > > memory regions. > > > > > > > > > > > > > > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509 > > > > > > > > > > > > > > > > It was 64, not 65. > > > > > > > > > > > > > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net. > > > > > > > > > > > > > > > > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com> > > > > > > > > > > > > > > > > Still thinking about this: can you reorder this to > > > > > > > > be the last patch in the series please? > > > > > > > sure > > > > > > > > > > > > > > > > > > > > > > > Also - 509? > > > > > > > userspace memory slots in terms of KVM, I made it match > > > > > > > KVM's allotment of memory slots for userspace side. > > > > > > > > > > > > Maybe KVM has its reasons for this #. I don't see > > > > > > why we need to match this exactly. > > > > > np, I can cap it at safe 300 slots but it's unlikely that it > > > > > would take cut off 1 extra hop since it's capped by QEMU > > > > > at 256+[initial fragmented memory] > > > > > > > > But what's the point? We allocate 32 bytes per slot. > > > > 300*32 = 9600 which is more than 8K, so we are doing > > > > an order-3 allocation anyway. > > > > If we could cap it at 8K (256 slots) that would make sense > > > > since we could avoid wasting vmalloc space. > > > 256 is amount of hotpluggable slots and there is no way > > > to predict how initial memory would be fragmented > > > (i.e. amount of slots it would take), if we guess wrong > > > we are back to square one with crashing userspace. > > > So I'd stay consistent with KVM's limit 509 since > > > it's only limit, i.e. not actual amount of allocated slots. > > > > > > > I'm still not very happy with the whole approach, > > > > giving userspace ability allocate 4 whole pages > > > > of kernel memory like this. > > > I'm working in parallel so that userspace won't take so > > > many slots but it won't prevent its current versions > > > crashing due to kernel limitation. > > > > Right but at least it's not a regression. If we promise userspace to > > support a ton of regions, we can't take it back later, and I'm concerned > > about the memory usage. > > > > I think it's already safe to merge the binary lookup patches, and maybe > > cache and vmalloc, so that the remaining patch will be small. > it isn't regression with switching to binary search and increasing > slots to 509 either performance wise it's more on improvment side. > And I was thinking about memory usage as well, that's why I've dropped > faster radix tree in favor of more compact array, can't do better > on kernel side of fix. > > Yes we will give userspace to ability to use more slots/and lock up > more memory if it's not able to consolidate memory regions but > that leaves an option for user to run guest with vhost performance > vs crashing it at runtime. Crashing is entirely QEMU's own doing in not handling the error gracefully. > > userspace/targets that could consolidate memory regions should > do so and I'm working on that as well but that doesn't mean > that users shouldn't have a choice. It's a fairly unusual corner case, I'm not yet convinced we need to quickly add support to it when just waiting a bit longer will get us an equivalent (or even more efficient) fix in userspace. > So far it's kernel limitation and this patch fixes crashes > that users see now, with the rest of patches enabling performance > not to regress. When I say regression I refer to an option to limit the array size again after userspace started using the larger size. > > > > > > > > > > > > > I think if we are changing this, it'd be nice to > > > > > > > > create a way for userspace to discover the support > > > > > > > > and the # of regions supported. > > > > > > > That was my first idea before extending KVM's memslots > > > > > > > to teach kernel to tell qemu this number so that QEMU > > > > > > > at least would be able to check if new memory slot could > > > > > > > be added but I was redirected to a more simple solution > > > > > > > of just extending vs everdoing things. > > > > > > > Currently QEMU supports upto ~250 memslots so 509 > > > > > > > is about twice high we need it so it should work for near > > > > > > > future > > > > > > > > > > > > Yes but old kernels are still around. Would be nice if you > > > > > > can detect them. > > > > > > > > > > > > > but eventually we might still teach kernel and QEMU > > > > > > > to make things more robust. > > > > > > > > > > > > A new ioctl would be easy to add, I think it's a good > > > > > > idea generally. > > > > > I can try to do something like this on top of this series. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- > > > > > > > > > drivers/vhost/vhost.c | 2 +- > > > > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > > > > > > > > index 99931a0..6a18c92 100644 > > > > > > > > > --- a/drivers/vhost/vhost.c > > > > > > > > > +++ b/drivers/vhost/vhost.c > > > > > > > > > @@ -30,7 +30,7 @@ > > > > > > > > > #include "vhost.h" > > > > > > > > > > > > > > > > > > enum { > > > > > > > > > - VHOST_MEMORY_MAX_NREGIONS = 64, > > > > > > > > > + VHOST_MEMORY_MAX_NREGIONS = 509, > > > > > > > > > VHOST_MEMORY_F_LOG = 0x1, > > > > > > > > > }; > > > > > > > > > > > > > > > > > > -- > > > > > > > > > 1.8.3.1 > > -- > > To unsubscribe from this list: send the line "unsubscribe kvm" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 17 Jun 2015 12:46:09 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Wed, Jun 17, 2015 at 12:37:42PM +0200, Igor Mammedov wrote: > > On Wed, 17 Jun 2015 12:11:09 +0200 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Wed, Jun 17, 2015 at 10:54:21AM +0200, Igor Mammedov wrote: > > > > On Wed, 17 Jun 2015 09:39:06 +0200 > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote: > > > > > > On Wed, 17 Jun 2015 08:34:26 +0200 > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > > > > > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote: > > > > > > > > On Tue, 16 Jun 2015 23:14:20 +0200 > > > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > > > > > > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote: > > > > > > > > > > since commit > > > > > > > > > > 1d4e7e3 kvm: x86: increase user memory slots to 509 > > > > > > > > > > > > > > > > > > > > it became possible to use a bigger amount of memory > > > > > > > > > > slots, which is used by memory hotplug for > > > > > > > > > > registering hotplugged memory. > > > > > > > > > > However QEMU crashes if it's used with more than ~60 > > > > > > > > > > pc-dimm devices and vhost-net since host kernel > > > > > > > > > > in module vhost-net refuses to accept more than 65 > > > > > > > > > > memory regions. > > > > > > > > > > > > > > > > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509 > > > > > > > > > > > > > > > > > > It was 64, not 65. > > > > > > > > > > > > > > > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net. > > > > > > > > > > > > > > > > > > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com> > > > > > > > > > > > > > > > > > > Still thinking about this: can you reorder this to > > > > > > > > > be the last patch in the series please? > > > > > > > > sure > > > > > > > > > > > > > > > > > > > > > > > > > > Also - 509? > > > > > > > > userspace memory slots in terms of KVM, I made it match > > > > > > > > KVM's allotment of memory slots for userspace side. > > > > > > > > > > > > > > Maybe KVM has its reasons for this #. I don't see > > > > > > > why we need to match this exactly. > > > > > > np, I can cap it at safe 300 slots but it's unlikely that it > > > > > > would take cut off 1 extra hop since it's capped by QEMU > > > > > > at 256+[initial fragmented memory] > > > > > > > > > > But what's the point? We allocate 32 bytes per slot. > > > > > 300*32 = 9600 which is more than 8K, so we are doing > > > > > an order-3 allocation anyway. > > > > > If we could cap it at 8K (256 slots) that would make sense > > > > > since we could avoid wasting vmalloc space. > > > > 256 is amount of hotpluggable slots and there is no way > > > > to predict how initial memory would be fragmented > > > > (i.e. amount of slots it would take), if we guess wrong > > > > we are back to square one with crashing userspace. > > > > So I'd stay consistent with KVM's limit 509 since > > > > it's only limit, i.e. not actual amount of allocated slots. > > > > > > > > > I'm still not very happy with the whole approach, > > > > > giving userspace ability allocate 4 whole pages > > > > > of kernel memory like this. > > > > I'm working in parallel so that userspace won't take so > > > > many slots but it won't prevent its current versions > > > > crashing due to kernel limitation. > > > > > > Right but at least it's not a regression. If we promise userspace to > > > support a ton of regions, we can't take it back later, and I'm concerned > > > about the memory usage. > > > > > > I think it's already safe to merge the binary lookup patches, and maybe > > > cache and vmalloc, so that the remaining patch will be small. > > it isn't regression with switching to binary search and increasing > > slots to 509 either performance wise it's more on improvment side. > > And I was thinking about memory usage as well, that's why I've dropped > > faster radix tree in favor of more compact array, can't do better > > on kernel side of fix. > > > > Yes we will give userspace to ability to use more slots/and lock up > > more memory if it's not able to consolidate memory regions but > > that leaves an option for user to run guest with vhost performance > > vs crashing it at runtime. > > Crashing is entirely QEMU's own doing in not handling > the error gracefully. and that's hard to fix (handle error gracefully) the way it's implemented now. > > > > userspace/targets that could consolidate memory regions should > > do so and I'm working on that as well but that doesn't mean > > that users shouldn't have a choice. > > It's a fairly unusual corner case, I'm not yet > convinced we need to quickly add support to it when just waiting a bit > longer will get us an equivalent (or even more efficient) fix in > userspace. with memory hotplug support in QEMU has been released for quite a long time already and there is users that use it so fix in the future QEMU won't make it work with their distros. So I wouldn't say that is "fairly unusual corner case". > > > So far it's kernel limitation and this patch fixes crashes > > that users see now, with the rest of patches enabling performance > > not to regress. > > When I say regression I refer to an option to limit the array > size again after userspace started using the larger size. Is there a need to do so? Userspace that cares about memory footprint won't use many slots keeping it low and user space that can't do without many slots or doesn't care will have bigger memory footprint. > > > > > > > > > > > > > > > > > > I think if we are changing this, it'd be nice to > > > > > > > > > create a way for userspace to discover the support > > > > > > > > > and the # of regions supported. > > > > > > > > That was my first idea before extending KVM's memslots > > > > > > > > to teach kernel to tell qemu this number so that QEMU > > > > > > > > at least would be able to check if new memory slot could > > > > > > > > be added but I was redirected to a more simple solution > > > > > > > > of just extending vs everdoing things. > > > > > > > > Currently QEMU supports upto ~250 memslots so 509 > > > > > > > > is about twice high we need it so it should work for near > > > > > > > > future > > > > > > > > > > > > > > Yes but old kernels are still around. Would be nice if you > > > > > > > can detect them. > > > > > > > > > > > > > > > but eventually we might still teach kernel and QEMU > > > > > > > > to make things more robust. > > > > > > > > > > > > > > A new ioctl would be easy to add, I think it's a good > > > > > > > idea generally. > > > > > > I can try to do something like this on top of this series. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- > > > > > > > > > > drivers/vhost/vhost.c | 2 +- > > > > > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > > > > > > > > > index 99931a0..6a18c92 100644 > > > > > > > > > > --- a/drivers/vhost/vhost.c > > > > > > > > > > +++ b/drivers/vhost/vhost.c > > > > > > > > > > @@ -30,7 +30,7 @@ > > > > > > > > > > #include "vhost.h" > > > > > > > > > > > > > > > > > > > > enum { > > > > > > > > > > - VHOST_MEMORY_MAX_NREGIONS = 64, > > > > > > > > > > + VHOST_MEMORY_MAX_NREGIONS = 509, > > > > > > > > > > VHOST_MEMORY_F_LOG = 0x1, > > > > > > > > > > }; > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > 1.8.3.1 > > > -- > > > To unsubscribe from this list: send the line "unsubscribe kvm" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 17, 2015 at 01:48:03PM +0200, Igor Mammedov wrote: > > > So far it's kernel limitation and this patch fixes crashes > > > that users see now, with the rest of patches enabling performance > > > not to regress. > > > > When I say regression I refer to an option to limit the array > > size again after userspace started using the larger size. > Is there a need to do so? Considering userspace can be malicious, I guess yes. > Userspace that cares about memory footprint won't use many slots > keeping it low and user space that can't do without many slots > or doesn't care will have bigger memory footprint. We really can't trust userspace to do the right thing though.
On Wed, 17 Jun 2015 13:51:56 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Wed, Jun 17, 2015 at 01:48:03PM +0200, Igor Mammedov wrote: > > > > So far it's kernel limitation and this patch fixes crashes > > > > that users see now, with the rest of patches enabling performance > > > > not to regress. > > > > > > When I say regression I refer to an option to limit the array > > > size again after userspace started using the larger size. > > Is there a need to do so? > > Considering userspace can be malicious, I guess yes. I don't think it's a valid concern in this case, setting limit back from 509 to 64 will not help here in any way, userspace still can create as many vhost instances as it needs to consume memory it desires. > > > Userspace that cares about memory footprint won't use many slots > > keeping it low and user space that can't do without many slots > > or doesn't care will have bigger memory footprint. > > We really can't trust userspace to do the right thing though. > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 17, 2015 at 02:23:39PM +0200, Igor Mammedov wrote: > On Wed, 17 Jun 2015 13:51:56 +0200 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Wed, Jun 17, 2015 at 01:48:03PM +0200, Igor Mammedov wrote: > > > > > So far it's kernel limitation and this patch fixes crashes > > > > > that users see now, with the rest of patches enabling performance > > > > > not to regress. > > > > > > > > When I say regression I refer to an option to limit the array > > > > size again after userspace started using the larger size. > > > Is there a need to do so? > > > > Considering userspace can be malicious, I guess yes. > I don't think it's a valid concern in this case, > setting limit back from 509 to 64 will not help here in any way, > userspace still can create as many vhost instances as it needs > to consume memory it desires. Not really since vhost char device isn't world-accessible. It's typically opened by a priveledged tool, the fd is then passed to an unpriveledged userspace, or permissions dropped. > > > > > Userspace that cares about memory footprint won't use many slots > > > keeping it low and user space that can't do without many slots > > > or doesn't care will have bigger memory footprint. > > > > We really can't trust userspace to do the right thing though. > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 17/06/2015 15:13, Michael S. Tsirkin wrote: > > > Considering userspace can be malicious, I guess yes. > > I don't think it's a valid concern in this case, > > setting limit back from 509 to 64 will not help here in any way, > > userspace still can create as many vhost instances as it needs > > to consume memory it desires. > > Not really since vhost char device isn't world-accessible. > It's typically opened by a priveledged tool, the fd is > then passed to an unpriveledged userspace, or permissions dropped. Then what's the concern anyway? Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote: > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote: > > > > Considering userspace can be malicious, I guess yes. > > > I don't think it's a valid concern in this case, > > > setting limit back from 509 to 64 will not help here in any way, > > > userspace still can create as many vhost instances as it needs > > > to consume memory it desires. > > > > Not really since vhost char device isn't world-accessible. > > It's typically opened by a priveledged tool, the fd is > > then passed to an unpriveledged userspace, or permissions dropped. > > Then what's the concern anyway? > > Paolo Each fd now ties up 16K of kernel memory. It didn't use to, so priveledged tool could safely give the unpriveledged userspace a ton of these fds.
On Wed, 17 Jun 2015 16:32:02 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote: > > > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote: > > > > > Considering userspace can be malicious, I guess yes. > > > > I don't think it's a valid concern in this case, > > > > setting limit back from 509 to 64 will not help here in any way, > > > > userspace still can create as many vhost instances as it needs > > > > to consume memory it desires. > > > > > > Not really since vhost char device isn't world-accessible. > > > It's typically opened by a priveledged tool, the fd is > > > then passed to an unpriveledged userspace, or permissions dropped. > > > > Then what's the concern anyway? > > > > Paolo > > Each fd now ties up 16K of kernel memory. It didn't use to, so > priveledged tool could safely give the unpriveledged userspace > a ton of these fds. if privileged tool gives out unlimited amount of fds then it doesn't matter whether fd ties 4K or 16K, host still could be DoSed. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote: > On Wed, 17 Jun 2015 16:32:02 +0200 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote: > > > > > > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote: > > > > > > Considering userspace can be malicious, I guess yes. > > > > > I don't think it's a valid concern in this case, > > > > > setting limit back from 509 to 64 will not help here in any way, > > > > > userspace still can create as many vhost instances as it needs > > > > > to consume memory it desires. > > > > > > > > Not really since vhost char device isn't world-accessible. > > > > It's typically opened by a priveledged tool, the fd is > > > > then passed to an unpriveledged userspace, or permissions dropped. > > > > > > Then what's the concern anyway? > > > > > > Paolo > > > > Each fd now ties up 16K of kernel memory. It didn't use to, so > > priveledged tool could safely give the unpriveledged userspace > > a ton of these fds. > if privileged tool gives out unlimited amount of fds then it > doesn't matter whether fd ties 4K or 16K, host still could be DoSed. > Of course it does not give out unlimited fds, there's a way for the sysadmin to specify the number of fds. Look at how libvirt uses vhost, it should become clear I think.
On Wed, 17 Jun 2015 17:38:40 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote: > > On Wed, 17 Jun 2015 16:32:02 +0200 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote: > > > > > > > > > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote: > > > > > > > Considering userspace can be malicious, I guess yes. > > > > > > I don't think it's a valid concern in this case, > > > > > > setting limit back from 509 to 64 will not help here in any > > > > > > way, userspace still can create as many vhost instances as > > > > > > it needs to consume memory it desires. > > > > > > > > > > Not really since vhost char device isn't world-accessible. > > > > > It's typically opened by a priveledged tool, the fd is > > > > > then passed to an unpriveledged userspace, or permissions > > > > > dropped. > > > > > > > > Then what's the concern anyway? > > > > > > > > Paolo > > > > > > Each fd now ties up 16K of kernel memory. It didn't use to, so > > > priveledged tool could safely give the unpriveledged userspace > > > a ton of these fds. > > if privileged tool gives out unlimited amount of fds then it > > doesn't matter whether fd ties 4K or 16K, host still could be DoSed. > > > > Of course it does not give out unlimited fds, there's a way > for the sysadmin to specify the number of fds. Look at how libvirt > uses vhost, it should become clear I think. then it just means that tool has to take into account a new limits to partition host in sensible manner. Exposing limit as module parameter might be of help to tool for getting/setting it in a way it needs. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote: > On Wed, 17 Jun 2015 17:38:40 +0200 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote: > > > On Wed, 17 Jun 2015 16:32:02 +0200 > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote: > > > > > > > > > > > > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote: > > > > > > > > Considering userspace can be malicious, I guess yes. > > > > > > > I don't think it's a valid concern in this case, > > > > > > > setting limit back from 509 to 64 will not help here in any > > > > > > > way, userspace still can create as many vhost instances as > > > > > > > it needs to consume memory it desires. > > > > > > > > > > > > Not really since vhost char device isn't world-accessible. > > > > > > It's typically opened by a priveledged tool, the fd is > > > > > > then passed to an unpriveledged userspace, or permissions > > > > > > dropped. > > > > > > > > > > Then what's the concern anyway? > > > > > > > > > > Paolo > > > > > > > > Each fd now ties up 16K of kernel memory. It didn't use to, so > > > > priveledged tool could safely give the unpriveledged userspace > > > > a ton of these fds. > > > if privileged tool gives out unlimited amount of fds then it > > > doesn't matter whether fd ties 4K or 16K, host still could be DoSed. > > > > > > > Of course it does not give out unlimited fds, there's a way > > for the sysadmin to specify the number of fds. Look at how libvirt > > uses vhost, it should become clear I think. > then it just means that tool has to take into account a new limits > to partition host in sensible manner. Meanwhile old tools are vulnerable to OOM attacks. > Exposing limit as module parameter might be of help to tool for > getting/setting it in a way it needs.
On 17/06/2015 18:30, Michael S. Tsirkin wrote:
> Meanwhile old tools are vulnerable to OOM attacks.
For each vhost device there will be likely one tap interface, and I
suspect that it takes way, way more than 16KB of memory.
Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote: > > > On 17/06/2015 18:30, Michael S. Tsirkin wrote: > > Meanwhile old tools are vulnerable to OOM attacks. > > For each vhost device there will be likely one tap interface, and I > suspect that it takes way, way more than 16KB of memory. > > Paolo That's not true. We have a vhost device per queue, all queues are part of a single tap device.
On 17/06/2015 18:34, Michael S. Tsirkin wrote: > On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote: >> >> >> On 17/06/2015 18:30, Michael S. Tsirkin wrote: >>> Meanwhile old tools are vulnerable to OOM attacks. >> >> For each vhost device there will be likely one tap interface, and I >> suspect that it takes way, way more than 16KB of memory. > > That's not true. We have a vhost device per queue, all queues > are part of a single tap device. s/tap/VCPU/ then. A KVM VCPU also takes more than 16KB of memory. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 17, 2015 at 06:38:25PM +0200, Paolo Bonzini wrote: > > > On 17/06/2015 18:34, Michael S. Tsirkin wrote: > > On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote: > >> > >> > >> On 17/06/2015 18:30, Michael S. Tsirkin wrote: > >>> Meanwhile old tools are vulnerable to OOM attacks. > >> > >> For each vhost device there will be likely one tap interface, and I > >> suspect that it takes way, way more than 16KB of memory. > > > > That's not true. We have a vhost device per queue, all queues > > are part of a single tap device. > > s/tap/VCPU/ then. A KVM VCPU also takes more than 16KB of memory. > > Paolo That's up to you as a kvm maintainer :) People are already concerned about vhost device memory usage, I'm not happy to define our user/kernel interface in a way that forces even more memory to be used up.
On 17/06/2015 18:41, Michael S. Tsirkin wrote: > On Wed, Jun 17, 2015 at 06:38:25PM +0200, Paolo Bonzini wrote: >> >> >> On 17/06/2015 18:34, Michael S. Tsirkin wrote: >>> On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote: >>>> >>>> >>>> On 17/06/2015 18:30, Michael S. Tsirkin wrote: >>>>> Meanwhile old tools are vulnerable to OOM attacks. >>>> >>>> For each vhost device there will be likely one tap interface, and I >>>> suspect that it takes way, way more than 16KB of memory. >>> >>> That's not true. We have a vhost device per queue, all queues >>> are part of a single tap device. >> >> s/tap/VCPU/ then. A KVM VCPU also takes more than 16KB of memory. > > That's up to you as a kvm maintainer :) Not easy, when the CPU alone requires three (albeit non-consecutive) pages for the VMCS, the APIC access page and the EPT root. > People are already concerned about vhost device > memory usage, I'm not happy to define our user/kernel interface > in a way that forces even more memory to be used up. So, the questions to ask are: 1) What is the memory usage like immediately after vhost is brought up, apart from these 16K? 2) Is there anything in vhost that allocates a user-controllable amount of memory? 3) What is the size of the data structures that support one virtqueue (there are two of them)? Does it depend on the size of the virtqueues? 4) Would it make sense to share memory regions between multiple vhost devices? Would it be hard to implement? It would also make memory operations O(1) rather than O(#cpus). Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 17 Jun 2015 18:30:02 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote: > > On Wed, 17 Jun 2015 17:38:40 +0200 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote: > > > > On Wed, 17 Jun 2015 16:32:02 +0200 > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote: > > > > > > > > > > > > > > > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote: > > > > > > > > > Considering userspace can be malicious, I guess yes. > > > > > > > > I don't think it's a valid concern in this case, > > > > > > > > setting limit back from 509 to 64 will not help here in > > > > > > > > any way, userspace still can create as many vhost > > > > > > > > instances as it needs to consume memory it desires. > > > > > > > > > > > > > > Not really since vhost char device isn't world-accessible. > > > > > > > It's typically opened by a priveledged tool, the fd is > > > > > > > then passed to an unpriveledged userspace, or permissions > > > > > > > dropped. > > > > > > > > > > > > Then what's the concern anyway? > > > > > > > > > > > > Paolo > > > > > > > > > > Each fd now ties up 16K of kernel memory. It didn't use to, > > > > > so priveledged tool could safely give the unpriveledged > > > > > userspace a ton of these fds. > > > > if privileged tool gives out unlimited amount of fds then it > > > > doesn't matter whether fd ties 4K or 16K, host still could be > > > > DoSed. > > > > > > > > > > Of course it does not give out unlimited fds, there's a way > > > for the sysadmin to specify the number of fds. Look at how libvirt > > > uses vhost, it should become clear I think. > > then it just means that tool has to take into account a new limits > > to partition host in sensible manner. > > Meanwhile old tools are vulnerable to OOM attacks. Let's leave old limit by default and allow override it via module parameter, that way tools old tools won't be affected and new tools could set limit the way they need. That will accommodate current slot hungry userspace and a new one with continuous HVA and won't regress old tools. > > > Exposing limit as module parameter might be of help to tool for > > getting/setting it in a way it needs. > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 17 Jun 2015 18:47:18 +0200 Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 17/06/2015 18:41, Michael S. Tsirkin wrote: > > On Wed, Jun 17, 2015 at 06:38:25PM +0200, Paolo Bonzini wrote: > >> > >> > >> On 17/06/2015 18:34, Michael S. Tsirkin wrote: > >>> On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote: > >>>> > >>>> > >>>> On 17/06/2015 18:30, Michael S. Tsirkin wrote: > >>>>> Meanwhile old tools are vulnerable to OOM attacks. > >>>> > >>>> For each vhost device there will be likely one tap interface, > >>>> and I suspect that it takes way, way more than 16KB of memory. > >>> > >>> That's not true. We have a vhost device per queue, all queues > >>> are part of a single tap device. > >> > >> s/tap/VCPU/ then. A KVM VCPU also takes more than 16KB of memory. > > > > That's up to you as a kvm maintainer :) > > Not easy, when the CPU alone requires three (albeit non-consecutive) > pages for the VMCS, the APIC access page and the EPT root. > > > People are already concerned about vhost device > > memory usage, I'm not happy to define our user/kernel interface > > in a way that forces even more memory to be used up. > > So, the questions to ask are: > > 1) What is the memory usage like immediately after vhost is brought > up, apart from these 16K? > > 2) Is there anything in vhost that allocates a user-controllable > amount of memory? > > 3) What is the size of the data structures that support one virtqueue > (there are two of them)? Does it depend on the size of the > virtqueues? > > 4) Would it make sense to share memory regions between multiple vhost > devices? Would it be hard to implement? It would also make memory > operations O(1) rather than O(#cpus). > > Paolo in addition to that could vhost share memmap with KVM i.e. use its memslots instead of duplicating it? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 17, 2015 at 06:47:18PM +0200, Paolo Bonzini wrote: > > > On 17/06/2015 18:41, Michael S. Tsirkin wrote: > > On Wed, Jun 17, 2015 at 06:38:25PM +0200, Paolo Bonzini wrote: > >> > >> > >> On 17/06/2015 18:34, Michael S. Tsirkin wrote: > >>> On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote: > >>>> > >>>> > >>>> On 17/06/2015 18:30, Michael S. Tsirkin wrote: > >>>>> Meanwhile old tools are vulnerable to OOM attacks. > >>>> > >>>> For each vhost device there will be likely one tap interface, and I > >>>> suspect that it takes way, way more than 16KB of memory. > >>> > >>> That's not true. We have a vhost device per queue, all queues > >>> are part of a single tap device. > >> > >> s/tap/VCPU/ then. A KVM VCPU also takes more than 16KB of memory. > > > > That's up to you as a kvm maintainer :) > > Not easy, when the CPU alone requires three (albeit non-consecutive) > pages for the VMCS, the APIC access page and the EPT root. > > > People are already concerned about vhost device > > memory usage, I'm not happy to define our user/kernel interface > > in a way that forces even more memory to be used up. > > So, the questions to ask are: > > 1) What is the memory usage like immediately after vhost is brought up, > apart from these 16K? About 24K, but most are iov pool arrays are kept around as an optimization to avoid kmalloc on data path. Below 1K tracks persistent state. Recently people have been complaining about these pools so I've been thinking about switching to a per-cpu array, or something similar. > 2) Is there anything in vhost that allocates a user-controllable amount > of memory? Definitely not in vhost-net. > 3) What is the size of the data structures that support one virtqueue > (there are two of them)? Around 256 bytes. > Does it depend on the size of the virtqueues? No. > 4) Would it make sense to share memory regions between multiple vhost > devices? Would it be hard to implement? It's not trivial. It would absolutely require userspace ABI extensions. > It would also make memory > operations O(1) rather than O(#cpus). > > Paolo We'd save the kmalloc/memcpy/kfree, that is true. But we'd still need to flush all VQs so it's still O(#cpus), we'd just be doing less stuff in that O(#cpus).
On Wed, 17 Jun 2015 18:30:02 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote: > > On Wed, 17 Jun 2015 17:38:40 +0200 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote: > > > > On Wed, 17 Jun 2015 16:32:02 +0200 > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote: > > > > > > > > > > > > > > > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote: > > > > > > > > > Considering userspace can be malicious, I guess yes. > > > > > > > > I don't think it's a valid concern in this case, > > > > > > > > setting limit back from 509 to 64 will not help here in any > > > > > > > > way, userspace still can create as many vhost instances as > > > > > > > > it needs to consume memory it desires. > > > > > > > > > > > > > > Not really since vhost char device isn't world-accessible. > > > > > > > It's typically opened by a priveledged tool, the fd is > > > > > > > then passed to an unpriveledged userspace, or permissions > > > > > > > dropped. > > > > > > > > > > > > Then what's the concern anyway? > > > > > > > > > > > > Paolo > > > > > > > > > > Each fd now ties up 16K of kernel memory. It didn't use to, so > > > > > priveledged tool could safely give the unpriveledged userspace > > > > > a ton of these fds. > > > > if privileged tool gives out unlimited amount of fds then it > > > > doesn't matter whether fd ties 4K or 16K, host still could be DoSed. > > > > > > > > > > Of course it does not give out unlimited fds, there's a way > > > for the sysadmin to specify the number of fds. Look at how libvirt > > > uses vhost, it should become clear I think. > > then it just means that tool has to take into account a new limits > > to partition host in sensible manner. > > Meanwhile old tools are vulnerable to OOM attacks. I've chatted with libvirt folks, it doesn't care about how much memory vhost would consume nor do any host capacity planning in that regard. But lets assume that there are tools that do this so how about instead of hardcoding limit make it a module parameter with default set to 64. That would allow users to set higher limit if they need it and nor regress old tools. it will also give tools interface for reading limit from vhost module. > > > Exposing limit as module parameter might be of help to tool for > > getting/setting it in a way it needs. > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 18, 2015 at 11:12:24AM +0200, Igor Mammedov wrote: > On Wed, 17 Jun 2015 18:30:02 +0200 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote: > > > On Wed, 17 Jun 2015 17:38:40 +0200 > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote: > > > > > On Wed, 17 Jun 2015 16:32:02 +0200 > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote: > > > > > > > > > > > > > > > > > > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote: > > > > > > > > > > Considering userspace can be malicious, I guess yes. > > > > > > > > > I don't think it's a valid concern in this case, > > > > > > > > > setting limit back from 509 to 64 will not help here in any > > > > > > > > > way, userspace still can create as many vhost instances as > > > > > > > > > it needs to consume memory it desires. > > > > > > > > > > > > > > > > Not really since vhost char device isn't world-accessible. > > > > > > > > It's typically opened by a priveledged tool, the fd is > > > > > > > > then passed to an unpriveledged userspace, or permissions > > > > > > > > dropped. > > > > > > > > > > > > > > Then what's the concern anyway? > > > > > > > > > > > > > > Paolo > > > > > > > > > > > > Each fd now ties up 16K of kernel memory. It didn't use to, so > > > > > > priveledged tool could safely give the unpriveledged userspace > > > > > > a ton of these fds. > > > > > if privileged tool gives out unlimited amount of fds then it > > > > > doesn't matter whether fd ties 4K or 16K, host still could be DoSed. > > > > > > > > > > > > > Of course it does not give out unlimited fds, there's a way > > > > for the sysadmin to specify the number of fds. Look at how libvirt > > > > uses vhost, it should become clear I think. > > > then it just means that tool has to take into account a new limits > > > to partition host in sensible manner. > > > > Meanwhile old tools are vulnerable to OOM attacks. > I've chatted with libvirt folks, it doesn't care about how much memory > vhost would consume nor do any host capacity planning in that regard. Exactly, it's up to host admin. > But lets assume that there are tools that do this so > how about instead of hardcoding limit make it a module parameter > with default set to 64. That would allow users to set higher limit > if they need it and nor regress old tools. it will also give tools > interface for reading limit from vhost module. And now you need to choose between security and functionality :( > > > > > Exposing limit as module parameter might be of help to tool for > > > getting/setting it in a way it needs. > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 18/06/2015 11:50, Michael S. Tsirkin wrote: > > But lets assume that there are tools that do this so > > how about instead of hardcoding limit make it a module parameter > > with default set to 64. That would allow users to set higher limit > > if they need it and nor regress old tools. it will also give tools > > interface for reading limit from vhost module. > > And now you need to choose between security and functionality :( Don't call "security" a 16K allocation that can fall back to vmalloc please. That's an insult to actual security problems... Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 18 Jun 2015 11:50:22 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Thu, Jun 18, 2015 at 11:12:24AM +0200, Igor Mammedov wrote: > > On Wed, 17 Jun 2015 18:30:02 +0200 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote: > > > > On Wed, 17 Jun 2015 17:38:40 +0200 > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote: > > > > > > On Wed, 17 Jun 2015 16:32:02 +0200 > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > > > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote: > > > > > > > > > > > > > > > > > > > > > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote: > > > > > > > > > > > Considering userspace can be malicious, I guess yes. > > > > > > > > > > I don't think it's a valid concern in this case, > > > > > > > > > > setting limit back from 509 to 64 will not help here in any > > > > > > > > > > way, userspace still can create as many vhost instances as > > > > > > > > > > it needs to consume memory it desires. > > > > > > > > > > > > > > > > > > Not really since vhost char device isn't world-accessible. > > > > > > > > > It's typically opened by a priveledged tool, the fd is > > > > > > > > > then passed to an unpriveledged userspace, or permissions > > > > > > > > > dropped. > > > > > > > > > > > > > > > > Then what's the concern anyway? > > > > > > > > > > > > > > > > Paolo > > > > > > > > > > > > > > Each fd now ties up 16K of kernel memory. It didn't use to, so > > > > > > > priveledged tool could safely give the unpriveledged userspace > > > > > > > a ton of these fds. > > > > > > if privileged tool gives out unlimited amount of fds then it > > > > > > doesn't matter whether fd ties 4K or 16K, host still could be DoSed. > > > > > > > > > > > > > > > > Of course it does not give out unlimited fds, there's a way > > > > > for the sysadmin to specify the number of fds. Look at how libvirt > > > > > uses vhost, it should become clear I think. > > > > then it just means that tool has to take into account a new limits > > > > to partition host in sensible manner. > > > > > > Meanwhile old tools are vulnerable to OOM attacks. > > I've chatted with libvirt folks, it doesn't care about how much memory > > vhost would consume nor do any host capacity planning in that regard. > > Exactly, it's up to host admin. > > > But lets assume that there are tools that do this so > > how about instead of hardcoding limit make it a module parameter > > with default set to 64. That would allow users to set higher limit > > if they need it and nor regress old tools. it will also give tools > > interface for reading limit from vhost module. > > And now you need to choose between security and functionality :( There is no conflict here and it's not about choosing. If admin has a method to estimate guest memory footprint to do capacity partitioning then he would need to redo partitioning taking in account new footprint when he/she rises limit manually. (BTW libvirt has tried and reverted patches that were trying to predict required memory, admin might be able to do it manually better but it's another topic how to do it ans it's not related to this thread) Lets leave decision upto users instead of making them live with crashing guests. > > > > > > > > Exposing limit as module parameter might be of help to tool for > > > > getting/setting it in a way it needs. > > > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote: > Lets leave decision upto users instead of making them live with > crashing guests. Come on, let's fix it in userspace.
On 18/06/2015 13:41, Michael S. Tsirkin wrote: > On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote: >> Lets leave decision upto users instead of making them live with >> crashing guests. > > Come on, let's fix it in userspace. It's not trivial to fix it in userspace. Since QEMU uses RCU there isn't a single memory map to use for a linear gpa->hva map. I find it absurd that we're fighting over 12K of memory. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 18 Jun 2015 13:41:22 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote: > > Lets leave decision upto users instead of making them live with > > crashing guests. > > Come on, let's fix it in userspace. I'm not abandoning userspace approach either but it might take time to implement in robust manner as it's much more complex and has much more places to backfire then a straightforward kernel fix which will work for both old userspace and a new one. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 18, 2015 at 01:50:32PM +0200, Paolo Bonzini wrote: > > > On 18/06/2015 13:41, Michael S. Tsirkin wrote: > > On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote: > >> Lets leave decision upto users instead of making them live with > >> crashing guests. > > > > Come on, let's fix it in userspace. > > It's not trivial to fix it in userspace. Since QEMU uses RCU there > isn't a single memory map to use for a linear gpa->hva map. Could you elaborate? I'm confused by this mention of RCU. You use RCU for accesses to the memory map, correct? So memory map itself is a write side operation, as such all you need to do is take some kind of lock to prevent conflicting with other memory maps, do rcu sync under this lock. > I find it absurd that we're fighting over 12K of memory. > > Paolo I wouldn't worry so much if it didn't affect kernel/userspace API. Need to be careful there.
On 18/06/2015 15:19, Michael S. Tsirkin wrote: > On Thu, Jun 18, 2015 at 01:50:32PM +0200, Paolo Bonzini wrote: >> >> >> On 18/06/2015 13:41, Michael S. Tsirkin wrote: >>> On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote: >>>> Lets leave decision upto users instead of making them live with >>>> crashing guests. >>> >>> Come on, let's fix it in userspace. >> >> It's not trivial to fix it in userspace. Since QEMU uses RCU there >> isn't a single memory map to use for a linear gpa->hva map. > > Could you elaborate? > > I'm confused by this mention of RCU. > You use RCU for accesses to the memory map, correct? > So memory map itself is a write side operation, as such all you need to > do is take some kind of lock to prevent conflicting with other memory > maps, do rcu sync under this lock. You're right, the problem isn't directly related to RCU. RCU would be easy to handle by using synchronize_rcu instead of call_rcu. While I identified an RCU-related problem with Igor's patches, it's much more entrenched. RAM can be used by asynchronous operations while the VM runs, between address_space_map and address_space_unmap. It is possible and common to have a quiescent state between the map and unmap, and a memory map change can happen in the middle of this. Normally this is not a problem, because changes to the memory map do not make the hva go away (memory regions are reference counted). However, with Igor's patches a memory_region_del_subregion will cause a mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away. I guess one way to do it would be to alias the same page in two places, one for use by vhost and one for use by everything else. However, the kernel does not provide the means to do this kind of aliasing for anonymous mmaps. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 18, 2015 at 03:46:14PM +0200, Paolo Bonzini wrote: > > > On 18/06/2015 15:19, Michael S. Tsirkin wrote: > > On Thu, Jun 18, 2015 at 01:50:32PM +0200, Paolo Bonzini wrote: > >> > >> > >> On 18/06/2015 13:41, Michael S. Tsirkin wrote: > >>> On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote: > >>>> Lets leave decision upto users instead of making them live with > >>>> crashing guests. > >>> > >>> Come on, let's fix it in userspace. > >> > >> It's not trivial to fix it in userspace. Since QEMU uses RCU there > >> isn't a single memory map to use for a linear gpa->hva map. > > > > Could you elaborate? > > > > I'm confused by this mention of RCU. > > You use RCU for accesses to the memory map, correct? > > So memory map itself is a write side operation, as such all you need to > > do is take some kind of lock to prevent conflicting with other memory > > maps, do rcu sync under this lock. > > You're right, the problem isn't directly related to RCU. RCU would be > easy to handle by using synchronize_rcu instead of call_rcu. While I > identified an RCU-related problem with Igor's patches, it's much more > entrenched. > > RAM can be used by asynchronous operations while the VM runs, between > address_space_map and address_space_unmap. It is possible and common to > have a quiescent state between the map and unmap, and a memory map > change can happen in the middle of this. Normally this is not a > problem, because changes to the memory map do not make the hva go away > (memory regions are reference counted). Right, so you want mmap(MAP_NORESERVE) when that reference count becomes 0. > However, with Igor's patches a memory_region_del_subregion will cause a > mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away. > > I guess one way to do it would be to alias the same page in two places, > one for use by vhost and one for use by everything else. However, the > kernel does not provide the means to do this kind of aliasing for > anonymous mmaps. > > Paolo Basically pages go away on munmap, so won't simple lock munmap mmap(MAP_NORESERVE) unlock do the trick?
On Thu, 18 Jun 2015 16:47:33 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Thu, Jun 18, 2015 at 03:46:14PM +0200, Paolo Bonzini wrote: > > > > > > On 18/06/2015 15:19, Michael S. Tsirkin wrote: > > > On Thu, Jun 18, 2015 at 01:50:32PM +0200, Paolo Bonzini wrote: > > >> > > >> > > >> On 18/06/2015 13:41, Michael S. Tsirkin wrote: > > >>> On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote: > > >>>> Lets leave decision upto users instead of making them live with > > >>>> crashing guests. > > >>> > > >>> Come on, let's fix it in userspace. > > >> > > >> It's not trivial to fix it in userspace. Since QEMU uses RCU there > > >> isn't a single memory map to use for a linear gpa->hva map. > > > > > > Could you elaborate? > > > > > > I'm confused by this mention of RCU. > > > You use RCU for accesses to the memory map, correct? > > > So memory map itself is a write side operation, as such all you need to > > > do is take some kind of lock to prevent conflicting with other memory > > > maps, do rcu sync under this lock. > > > > You're right, the problem isn't directly related to RCU. RCU would be > > easy to handle by using synchronize_rcu instead of call_rcu. While I > > identified an RCU-related problem with Igor's patches, it's much more > > entrenched. > > > > RAM can be used by asynchronous operations while the VM runs, between > > address_space_map and address_space_unmap. It is possible and common to > > have a quiescent state between the map and unmap, and a memory map > > change can happen in the middle of this. Normally this is not a > > problem, because changes to the memory map do not make the hva go away > > (memory regions are reference counted). > > Right, so you want mmap(MAP_NORESERVE) when that reference > count becomes 0. > > > However, with Igor's patches a memory_region_del_subregion will cause a > > mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away. > > > > I guess one way to do it would be to alias the same page in two places, > > one for use by vhost and one for use by everything else. However, the > > kernel does not provide the means to do this kind of aliasing for > > anonymous mmaps. > > > > Paolo > > Basically pages go away on munmap, so won't simple > lock > munmap > mmap(MAP_NORESERVE) > unlock > do the trick? at what time are you suggesting to do this? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 18/06/2015 16:47, Michael S. Tsirkin wrote: >> However, with Igor's patches a memory_region_del_subregion will cause a >> mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away. >> >> I guess one way to do it would be to alias the same page in two places, >> one for use by vhost and one for use by everything else. However, the >> kernel does not provide the means to do this kind of aliasing for >> anonymous mmaps. > > Basically pages go away on munmap, so won't simple > lock > munmap > mmap(MAP_NORESERVE) > unlock > do the trick? Not sure I follow. Here we have this: VCPU 1 VCPU 2 I/O worker ---------------------------------------------------------------------------------------- take big QEMU lock p = address_space_map(hva, len) pass I/O request to worker thread read(fd, p, len) release big QEMU lock memory_region_del_subregion mmap(MAP_NORESERVE) read returns EFAULT wake up VCPU 1 take big QEMU lock EFAULT? What's that? In another scenario you are less lucky: the memory accesses between address_space_map/unmap aren't done in the kernel and you get a plain old SIGSEGV. This is not something that you can fix with a lock. The very purpose of the map/unmap API is to do stuff asynchronously while the lock is released. Thanks, Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 18, 2015 at 06:02:46PM +0200, Paolo Bonzini wrote: > > > On 18/06/2015 16:47, Michael S. Tsirkin wrote: > >> However, with Igor's patches a memory_region_del_subregion will cause a > >> mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away. > >> > >> I guess one way to do it would be to alias the same page in two places, > >> one for use by vhost and one for use by everything else. However, the > >> kernel does not provide the means to do this kind of aliasing for > >> anonymous mmaps. > > > > Basically pages go away on munmap, so won't simple > > lock > > munmap > > mmap(MAP_NORESERVE) > > unlock > > do the trick? > > Not sure I follow. Here we have this: > > VCPU 1 VCPU 2 I/O worker > ---------------------------------------------------------------------------------------- > take big QEMU lock > p = address_space_map(hva, len) > pass I/O request to worker thread > read(fd, p, len) > release big QEMU lock > > memory_region_del_subregion > mmap(MAP_NORESERVE) > > read returns EFAULT Why doesn't it EFAULT without mmap(MAP_NORESERVE)? Doesn't memory_region_del_subregion free the memory? > wake up VCPU 1 > take big QEMU lock > EFAULT? What's that? > > In another scenario you are less lucky: the memory accesses > between address_space_map/unmap aren't done in the kernel and > you get a plain old SIGSEGV. > > This is not something that you can fix with a lock. The very > purpose of the map/unmap API is to do stuff asynchronously while > the lock is released. > > Thanks, > > Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 19/06/2015 09:56, Michael S. Tsirkin wrote: > On Thu, Jun 18, 2015 at 06:02:46PM +0200, Paolo Bonzini wrote: >> >> >> On 18/06/2015 16:47, Michael S. Tsirkin wrote: >>>> However, with Igor's patches a memory_region_del_subregion will cause a >>>> mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away. >>>> >>>> I guess one way to do it would be to alias the same page in two places, >>>> one for use by vhost and one for use by everything else. However, the >>>> kernel does not provide the means to do this kind of aliasing for >>>> anonymous mmaps. >>> >>> Basically pages go away on munmap, so won't simple >>> lock >>> munmap >>> mmap(MAP_NORESERVE) >>> unlock >>> do the trick? >> >> Not sure I follow. Here we have this: >> >> VCPU 1 VCPU 2 I/O worker >> ---------------------------------------------------------------------------------------- >> take big QEMU lock >> p = address_space_map(hva, len) >> pass I/O request to worker thread >> read(fd, p, len) >> release big QEMU lock >> >> memory_region_del_subregion >> mmap(MAP_NORESERVE) >> >> read returns EFAULT > > Why doesn't it EFAULT without mmap(MAP_NORESERVE)? > Doesn't memory_region_del_subregion free the memory? No, only destruction of the memory region frees it. address_space_map takes a reference to the memory region and address_space_unmap releases it. Paolo >> wake up VCPU 1 >> take big QEMU lock >> EFAULT? What's that? >> >> In another scenario you are less lucky: the memory accesses >> between address_space_map/unmap aren't done in the kernel and >> you get a plain old SIGSEGV. >> >> This is not something that you can fix with a lock. The very >> purpose of the map/unmap API is to do stuff asynchronously while >> the lock is released. >> >> Thanks, >> >> Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 19, 2015 at 09:57:22AM +0200, Paolo Bonzini wrote: > > > On 19/06/2015 09:56, Michael S. Tsirkin wrote: > > On Thu, Jun 18, 2015 at 06:02:46PM +0200, Paolo Bonzini wrote: > >> > >> > >> On 18/06/2015 16:47, Michael S. Tsirkin wrote: > >>>> However, with Igor's patches a memory_region_del_subregion will cause a > >>>> mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away. > >>>> > >>>> I guess one way to do it would be to alias the same page in two places, > >>>> one for use by vhost and one for use by everything else. However, the > >>>> kernel does not provide the means to do this kind of aliasing for > >>>> anonymous mmaps. > >>> > >>> Basically pages go away on munmap, so won't simple > >>> lock > >>> munmap > >>> mmap(MAP_NORESERVE) > >>> unlock > >>> do the trick? > >> > >> Not sure I follow. Here we have this: > >> > >> VCPU 1 VCPU 2 I/O worker > >> ---------------------------------------------------------------------------------------- > >> take big QEMU lock > >> p = address_space_map(hva, len) > >> pass I/O request to worker thread > >> read(fd, p, len) > >> release big QEMU lock > >> > >> memory_region_del_subregion > >> mmap(MAP_NORESERVE) > >> > >> read returns EFAULT > > > > Why doesn't it EFAULT without mmap(MAP_NORESERVE)? > > Doesn't memory_region_del_subregion free the memory? > > No, only destruction of the memory region frees it. address_space_map > takes a reference to the memory region and address_space_unmap releases it. > > Paolo Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap after we detect refcount is 0? > >> wake up VCPU 1 > >> take big QEMU lock > >> EFAULT? What's that? > >> > >> In another scenario you are less lucky: the memory accesses > >> between address_space_map/unmap aren't done in the kernel and > >> you get a plain old SIGSEGV. > >> > >> This is not something that you can fix with a lock. The very > >> purpose of the map/unmap API is to do stuff asynchronously while > >> the lock is released. > >> > >> Thanks, > >> > >> Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 19/06/2015 10:05, Michael S. Tsirkin wrote: > > No, only destruction of the memory region frees it. address_space_map > > takes a reference to the memory region and address_space_unmap releases it. > > > > Paolo > > Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap > after we detect refcount is 0? No, because in the meanwhile another DIMM could have been hotplugged at the same place where the old one was. This is legal: user guest QEMU ---------------------------------------------------------------------------------------- start I/O '---------------> address_space_map device_del '-------------------> receives SCI executes _EJ0 '---------------> memory_region_del_subregion object_unparent device_add '-----------------------------------------> device_set_realized hotplug_handler_plug pc_machine_device_plug_cb pc_dimm_plug memory_region_add_subregion I/O finishes address_space_unmap Surprise removal similarly could be done in QEMU, but it will hold to some resources for as long as the device backends need them. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 19, 2015 at 10:52:47AM +0200, Paolo Bonzini wrote: > > > On 19/06/2015 10:05, Michael S. Tsirkin wrote: > > > No, only destruction of the memory region frees it. address_space_map > > > takes a reference to the memory region and address_space_unmap releases it. > > > > > > Paolo > > > > Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap > > after we detect refcount is 0? > > No, because in the meanwhile another DIMM could have been hotplugged > at the same place where the old one was. This is legal: > > user guest QEMU > ---------------------------------------------------------------------------------------- > start I/O > '---------------> address_space_map > device_del > '-------------------> receives SCI > executes _EJ0 > '---------------> memory_region_del_subregion > object_unparent So guest started DMA into memory, then ejected this memory while DMA is in progress? > device_add > '-----------------------------------------> device_set_realized > hotplug_handler_plug > pc_machine_device_plug_cb > pc_dimm_plug > memory_region_add_subregion > > I/O finishes > address_space_unmap > > Surprise removal similarly could be done in QEMU, but it will hold to > some resources for as long as the device backends need them. > > Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 19/06/2015 12:14, Michael S. Tsirkin wrote: > On Fri, Jun 19, 2015 at 10:52:47AM +0200, Paolo Bonzini wrote: >> >> >> On 19/06/2015 10:05, Michael S. Tsirkin wrote: >>>> No, only destruction of the memory region frees it. address_space_map >>>> takes a reference to the memory region and address_space_unmap releases it. >>>> >>>> Paolo >>> >>> Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap >>> after we detect refcount is 0? >> >> No, because in the meanwhile another DIMM could have been hotplugged >> at the same place where the old one was. This is legal: >> >> user guest QEMU >> ---------------------------------------------------------------------------------------- >> start I/O >> '---------------> address_space_map >> device_del >> '-------------------> receives SCI >> executes _EJ0 >> '---------------> memory_region_del_subregion >> object_unparent > > So guest started DMA into memory, then ejected this memory while DMA > is in progress? Yes. There is nothing that forbids doing that. Paolo >> device_add >> '-----------------------------------------> device_set_realized >> hotplug_handler_plug >> pc_machine_device_plug_cb >> pc_dimm_plug >> memory_region_add_subregion >> >> I/O finishes >> address_space_unmap >> >> Surprise removal similarly could be done in QEMU, but it will hold to >> some resources for as long as the device backends need them. >> >> Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 19, 2015 at 12:44:26PM +0200, Paolo Bonzini wrote: > > > On 19/06/2015 12:14, Michael S. Tsirkin wrote: > > On Fri, Jun 19, 2015 at 10:52:47AM +0200, Paolo Bonzini wrote: > >> > >> > >> On 19/06/2015 10:05, Michael S. Tsirkin wrote: > >>>> No, only destruction of the memory region frees it. address_space_map > >>>> takes a reference to the memory region and address_space_unmap releases it. > >>>> > >>>> Paolo > >>> > >>> Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap > >>> after we detect refcount is 0? > >> > >> No, because in the meanwhile another DIMM could have been hotplugged > >> at the same place where the old one was. This is legal: > >> > >> user guest QEMU > >> ---------------------------------------------------------------------------------------- > >> start I/O > >> '---------------> address_space_map > >> device_del > >> '-------------------> receives SCI > >> executes _EJ0 > >> '---------------> memory_region_del_subregion > >> object_unparent > > > > So guest started DMA into memory, then ejected this memory while DMA > > is in progress? > > Yes. There is nothing that forbids doing that. > > Paolo Can we simply defer the next device_add using a hva until all IO completes? > >> device_add > >> '-----------------------------------------> device_set_realized > >> hotplug_handler_plug > >> pc_machine_device_plug_cb > >> pc_dimm_plug > >> memory_region_add_subregion > >> > >> I/O finishes > >> address_space_unmap > >> > >> Surprise removal similarly could be done in QEMU, but it will hold to > >> some resources for as long as the device backends need them. > >> > >> Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in
On 19/06/2015 15:34, Michael S. Tsirkin wrote: > On Fri, Jun 19, 2015 at 12:44:26PM +0200, Paolo Bonzini wrote: >> >> >> On 19/06/2015 12:14, Michael S. Tsirkin wrote: >>> On Fri, Jun 19, 2015 at 10:52:47AM +0200, Paolo Bonzini wrote: >>>> >>>> >>>> On 19/06/2015 10:05, Michael S. Tsirkin wrote: >>>>>> No, only destruction of the memory region frees it. address_space_map >>>>>> takes a reference to the memory region and address_space_unmap releases it. >>>>>> >>>>>> Paolo >>>>> >>>>> Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap >>>>> after we detect refcount is 0? >>>> >>>> No, because in the meanwhile another DIMM could have been hotplugged >>>> at the same place where the old one was. This is legal: >>>> >>>> user guest QEMU >>>> ---------------------------------------------------------------------------------------- >>>> start I/O >>>> '---------------> address_space_map >>>> device_del >>>> '-------------------> receives SCI >>>> executes _EJ0 >>>> '---------------> memory_region_del_subregion >>>> object_unparent >>> >>> So guest started DMA into memory, then ejected this memory while DMA >>> is in progress? >> >> Yes. There is nothing that forbids doing that. > > Can we simply defer the next device_add using a hva until all IO completes? We could, but I/O is just an example. It can be I/O, a network ring, whatever. We cannot audit all address_space_map uses. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in
On Fri, Jun 19, 2015 at 05:19:44PM +0200, Paolo Bonzini wrote: > > > On 19/06/2015 15:34, Michael S. Tsirkin wrote: > > On Fri, Jun 19, 2015 at 12:44:26PM +0200, Paolo Bonzini wrote: > >> > >> > >> On 19/06/2015 12:14, Michael S. Tsirkin wrote: > >>> On Fri, Jun 19, 2015 at 10:52:47AM +0200, Paolo Bonzini wrote: > >>>> > >>>> > >>>> On 19/06/2015 10:05, Michael S. Tsirkin wrote: > >>>>>> No, only destruction of the memory region frees it. address_space_map > >>>>>> takes a reference to the memory region and address_space_unmap releases it. > >>>>>> > >>>>>> Paolo > >>>>> > >>>>> Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap > >>>>> after we detect refcount is 0? > >>>> > >>>> No, because in the meanwhile another DIMM could have been hotplugged > >>>> at the same place where the old one was. This is legal: > >>>> > >>>> user guest QEMU > >>>> ---------------------------------------------------------------------------------------- > >>>> start I/O > >>>> '---------------> address_space_map > >>>> device_del > >>>> '-------------------> receives SCI > >>>> executes _EJ0 > >>>> '---------------> memory_region_del_subregion > >>>> object_unparent > >>> > >>> So guest started DMA into memory, then ejected this memory while DMA > >>> is in progress? > >> > >> Yes. There is nothing that forbids doing that. > > > > Can we simply defer the next device_add using a hva until all IO completes? > > We could, but I/O is just an example. It can be I/O, a network ring, > whatever. We cannot audit all address_space_map uses. > > Paolo No need to audit them all: defer device_add using an hva range until address_space_unmap drops using hvas in range drops reference count to 0.
On 19/06/2015 18:20, Michael S. Tsirkin wrote: > > We could, but I/O is just an example. It can be I/O, a network ring, > > whatever. We cannot audit all address_space_map uses. > > > > No need to audit them all: defer device_add using an hva range until > address_space_unmap drops using hvas in range drops reference count to > 0. That could be forever. You certainly don't want to lockup the monitor forever just because a device model isn't too friendly to memory hot-unplug. That's why you need to audit them (also, it's perfectly in the device model's right to use address_space_unmap this way: it's the guest that's buggy and leaves a dangling reference to a region before unplugging it). Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in
On Fri, Jun 19, 2015 at 06:26:27PM +0200, Paolo Bonzini wrote: > > > On 19/06/2015 18:20, Michael S. Tsirkin wrote: > > > We could, but I/O is just an example. It can be I/O, a network ring, > > > whatever. We cannot audit all address_space_map uses. > > > > > > > No need to audit them all: defer device_add using an hva range until > > address_space_unmap drops using hvas in range drops reference count to > > 0. > > That could be forever. You certainly don't want to lockup the monitor > forever just because a device model isn't too friendly to memory hot-unplug. We can defer the addition, no need to lockup the monitor. > That's why you need to audit them (also, it's perfectly in the device > model's right to use address_space_unmap this way: it's the guest that's > buggy and leaves a dangling reference to a region before unplugging it). > > Paolo Then maybe it's not too bad that the guest will crash because the memory was unmapped.
On 19/06/2015 18:33, Michael S. Tsirkin wrote: > On Fri, Jun 19, 2015 at 06:26:27PM +0200, Paolo Bonzini wrote: >> >> >> On 19/06/2015 18:20, Michael S. Tsirkin wrote: >>>> We could, but I/O is just an example. It can be I/O, a network ring, >>>> whatever. We cannot audit all address_space_map uses. >>>> >>> >>> No need to audit them all: defer device_add using an hva range until >>> address_space_unmap drops using hvas in range drops reference count to >>> 0. >> >> That could be forever. You certainly don't want to lockup the monitor >> forever just because a device model isn't too friendly to memory hot-unplug. > > We can defer the addition, no need to lockup the monitor. Patches are welcome. >> That's why you need to audit them (also, it's perfectly in the device >> model's right to use address_space_unmap this way: it's the guest that's >> buggy and leaves a dangling reference to a region before unplugging it). > > Then maybe it's not too bad that the guest will crash because the memory > was unmapped. That's a matter of taste. I strongly prefer using 12K extra memory per VCPU to a guest crash. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in
On Fri, Jun 19, 2015 at 10:52:47AM +0200, Paolo Bonzini wrote: > > > On 19/06/2015 10:05, Michael S. Tsirkin wrote: > > > No, only destruction of the memory region frees it. address_space_map > > > takes a reference to the memory region and address_space_unmap releases it. > > > > > > Paolo > > > > Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap > > after we detect refcount is 0? > > No, because in the meanwhile another DIMM could have been hotplugged > at the same place where the old one was. This is legal: > > user guest QEMU > ---------------------------------------------------------------------------------------- > start I/O > '---------------> address_space_map > device_del > '-------------------> receives SCI > executes _EJ0 > '---------------> memory_region_del_subregion > object_unparent > device_add > '-----------------------------------------> device_set_realized > hotplug_handler_plug > pc_machine_device_plug_cb > pc_dimm_plug > memory_region_add_subregion > > I/O finishes > address_space_unmap > > Surprise removal similarly could be done in QEMU, but it will hold to > some resources for as long as the device backends need them. > > Paolo OK so what's the problem with checking for this condition, after address_space_unmap detects that ref count is 0 and before calling mmap(MAP_NORESERVE)? At that point we are off the data path so we can take locks.
On 19/06/2015 18:45, Michael S. Tsirkin wrote: > > user guest QEMU > > ---------------------------------------------------------------------------------------- > > start I/O > > '---------------> address_space_map > > device_del > > '-------------------> receives SCI > > executes _EJ0 > > '---------------> memory_region_del_subregion > > object_unparent > > device_add > > '-----------------------------------------> device_set_realized > > hotplug_handler_plug > > pc_machine_device_plug_cb > > pc_dimm_plug > > memory_region_add_subregion > > > > I/O finishes > > address_space_unmap > > > > OK so what's the problem with checking for this condition, > after address_space_unmap detects that ref count is 0 > and before calling mmap(MAP_NORESERVE)? Should be okay. However, someone still needs to figure out the part where device_add is delayed until the address_space_unmap. Otherwise, the device_add's mmap() overlays the older one and the delayed mmap(MAP_NORESERVE) overlays the device_add's mmap(). It seems like a colossal waste of time to perfect a little-used path that no doubt will have bugs in it, some of which could perhaps have security consequences. Also, QEMU is not the only vhost client, so we're just scratching the surface here. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in
On Fri, 19 Jun 2015 18:33:39 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Fri, Jun 19, 2015 at 06:26:27PM +0200, Paolo Bonzini wrote: > > > > > > On 19/06/2015 18:20, Michael S. Tsirkin wrote: > > > > We could, but I/O is just an example. It can be I/O, a network ring, > > > > whatever. We cannot audit all address_space_map uses. > > > > > > > > > > No need to audit them all: defer device_add using an hva range until > > > address_space_unmap drops using hvas in range drops reference count to > > > 0. > > > > That could be forever. You certainly don't want to lockup the monitor > > forever just because a device model isn't too friendly to memory hot-unplug. > > We can defer the addition, no need to lockup the monitor. > > > That's why you need to audit them (also, it's perfectly in the device > > model's right to use address_space_unmap this way: it's the guest that's > > buggy and leaves a dangling reference to a region before unplugging it). > > > > Paolo > > Then maybe it's not too bad that the guest will crash because the memory > was unmapped. So far HVA is unusable even if we will make this assumption and let guest crash. virt_net doesn't work with it anyway, translation of GPA to HVA for descriptors works as expected (correctly) but vhost+HVA hack backed virtio still can't send/received packets. That's why I prefer to merge kernel solution first as a stable and not introducing any issues solution. And work on userspace approach on top of that. Hopefully it could be done but we still would need time to iron out side effects/issues it causes or could cause so that fix became stable enough for production. -- To unsubscribe from this list: send the line "unsubscribe kvm" in
On 22/06/2015 09:10, Igor Mammedov wrote: > So far HVA is unusable even if we will make this assumption and let guest crash. > virt_net doesn't work with it anyway, > translation of GPA to HVA for descriptors works as expected (correctly) > but vhost+HVA hack backed virtio still can't send/received packets. > > That's why I prefer to merge kernel solution first as a stable and > not introducing any issues solution. And work on userspace approach on > top of that. Also, let's do some math. Let's assume 3 network devices per VM, one vhost device per queue, one queue per VCPU per network device. Let's assume the host is overcommitted 3:1. Thus we have 3*3=9 times vhost devices as we have physical CPUs. We're thus talking about 108K per physical CPU. From a relative point of view, and assuming 1 GB of memory per physical CPU (pretty low amount if you're overcommitting CPU 3:1), this is 0.01% of the total memory. From an absolute point of view, it takes a system with 60 physical CPUs to reach the same memory usage as the vmlinuz binary of a typical distro kernel (not counting the modules). Paolo > Hopefully it could be done but we still would need time > to iron out side effects/issues it causes or could cause so that > fix became stable enough for production. > -- To unsubscribe from this list: send the line "unsubscribe kvm" in
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 99931a0..6a18c92 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -30,7 +30,7 @@ #include "vhost.h" enum { - VHOST_MEMORY_MAX_NREGIONS = 64, + VHOST_MEMORY_MAX_NREGIONS = 509, VHOST_MEMORY_F_LOG = 0x1, };
since commit 1d4e7e3 kvm: x86: increase user memory slots to 509 it became possible to use a bigger amount of memory slots, which is used by memory hotplug for registering hotplugged memory. However QEMU crashes if it's used with more than ~60 pc-dimm devices and vhost-net since host kernel in module vhost-net refuses to accept more than 65 memory regions. Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509 to match KVM_USER_MEM_SLOTS fixes issue for vhost-net. Signed-off-by: Igor Mammedov <imammedo@redhat.com> --- drivers/vhost/vhost.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)