diff mbox

[3/5] vhost: support upto 509 memory regions

Message ID 1434472419-148742-4-git-send-email-imammedo@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Igor Mammedov June 16, 2015, 4:33 p.m. UTC
since commit
 1d4e7e3 kvm: x86: increase user memory slots to 509

it became possible to use a bigger amount of memory
slots, which is used by memory hotplug for
registering hotplugged memory.
However QEMU crashes if it's used with more than ~60
pc-dimm devices and vhost-net since host kernel
in module vhost-net refuses to accept more than 65
memory regions.

Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
 drivers/vhost/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Michael S. Tsirkin June 16, 2015, 9:14 p.m. UTC | #1
On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> since commit
>  1d4e7e3 kvm: x86: increase user memory slots to 509
> 
> it became possible to use a bigger amount of memory
> slots, which is used by memory hotplug for
> registering hotplugged memory.
> However QEMU crashes if it's used with more than ~60
> pc-dimm devices and vhost-net since host kernel
> in module vhost-net refuses to accept more than 65
> memory regions.
> 
> Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509

It was 64, not 65.

> to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> 
> Signed-off-by: Igor Mammedov <imammedo@redhat.com>

Still thinking about this: can you reorder this to
be the last patch in the series please?

Also - 509?
I think if we are changing this, it'd be nice to
create a way for userspace to discover the support
and the # of regions supported.


> ---
>  drivers/vhost/vhost.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 99931a0..6a18c92 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -30,7 +30,7 @@
>  #include "vhost.h"
>  
>  enum {
> -	VHOST_MEMORY_MAX_NREGIONS = 64,
> +	VHOST_MEMORY_MAX_NREGIONS = 509,
>  	VHOST_MEMORY_F_LOG = 0x1,
>  };
>  
> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Igor Mammedov June 16, 2015, 10 p.m. UTC | #2
On Tue, 16 Jun 2015 23:14:20 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > since commit
> >  1d4e7e3 kvm: x86: increase user memory slots to 509
> > 
> > it became possible to use a bigger amount of memory
> > slots, which is used by memory hotplug for
> > registering hotplugged memory.
> > However QEMU crashes if it's used with more than ~60
> > pc-dimm devices and vhost-net since host kernel
> > in module vhost-net refuses to accept more than 65
> > memory regions.
> > 
> > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> 
> It was 64, not 65.
> 
> > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > 
> > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> 
> Still thinking about this: can you reorder this to
> be the last patch in the series please?
sure

> 
> Also - 509?
userspace memory slots in terms of KVM, I made it match
KVM's allotment of memory slots for userspace side.

> I think if we are changing this, it'd be nice to
> create a way for userspace to discover the support
> and the # of regions supported.
That was my first idea before extending KVM's memslots
to teach kernel to tell qemu this number so that QEMU
at least would be able to check if new memory slot could
be added but I was redirected to a more simple solution
of just extending vs everdoing things.
Currently QEMU supports upto ~250 memslots so 509
is about twice high we need it so it should work for near
future but eventually we might still teach kernel and QEMU
to make things more robust.

> 
> 
> > ---
> >  drivers/vhost/vhost.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 99931a0..6a18c92 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -30,7 +30,7 @@
> >  #include "vhost.h"
> >  
> >  enum {
> > -	VHOST_MEMORY_MAX_NREGIONS = 64,
> > +	VHOST_MEMORY_MAX_NREGIONS = 509,
> >  	VHOST_MEMORY_F_LOG = 0x1,
> >  };
> >  
> > -- 
> > 1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 17, 2015, 6:34 a.m. UTC | #3
On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> On Tue, 16 Jun 2015 23:14:20 +0200
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > since commit
> > >  1d4e7e3 kvm: x86: increase user memory slots to 509
> > > 
> > > it became possible to use a bigger amount of memory
> > > slots, which is used by memory hotplug for
> > > registering hotplugged memory.
> > > However QEMU crashes if it's used with more than ~60
> > > pc-dimm devices and vhost-net since host kernel
> > > in module vhost-net refuses to accept more than 65
> > > memory regions.
> > > 
> > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> > 
> > It was 64, not 65.
> > 
> > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > > 
> > > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> > 
> > Still thinking about this: can you reorder this to
> > be the last patch in the series please?
> sure
> 
> > 
> > Also - 509?
> userspace memory slots in terms of KVM, I made it match
> KVM's allotment of memory slots for userspace side.

Maybe KVM has its reasons for this #. I don't see
why we need to match this exactly.

> > I think if we are changing this, it'd be nice to
> > create a way for userspace to discover the support
> > and the # of regions supported.
> That was my first idea before extending KVM's memslots
> to teach kernel to tell qemu this number so that QEMU
> at least would be able to check if new memory slot could
> be added but I was redirected to a more simple solution
> of just extending vs everdoing things.
> Currently QEMU supports upto ~250 memslots so 509
> is about twice high we need it so it should work for near
> future

Yes but old kernels are still around. Would be nice if you
can detect them.

> but eventually we might still teach kernel and QEMU
> to make things more robust.

A new ioctl would be easy to add, I think it's a good
idea generally.

> > 
> > 
> > > ---
> > >  drivers/vhost/vhost.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > index 99931a0..6a18c92 100644
> > > --- a/drivers/vhost/vhost.c
> > > +++ b/drivers/vhost/vhost.c
> > > @@ -30,7 +30,7 @@
> > >  #include "vhost.h"
> > >  
> > >  enum {
> > > -	VHOST_MEMORY_MAX_NREGIONS = 64,
> > > +	VHOST_MEMORY_MAX_NREGIONS = 509,
> > >  	VHOST_MEMORY_F_LOG = 0x1,
> > >  };
> > >  
> > > -- 
> > > 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Igor Mammedov June 17, 2015, 7:28 a.m. UTC | #4
On Wed, 17 Jun 2015 08:34:26 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> > On Tue, 16 Jun 2015 23:14:20 +0200
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > > since commit
> > > >  1d4e7e3 kvm: x86: increase user memory slots to 509
> > > > 
> > > > it became possible to use a bigger amount of memory
> > > > slots, which is used by memory hotplug for
> > > > registering hotplugged memory.
> > > > However QEMU crashes if it's used with more than ~60
> > > > pc-dimm devices and vhost-net since host kernel
> > > > in module vhost-net refuses to accept more than 65
> > > > memory regions.
> > > > 
> > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> > > 
> > > It was 64, not 65.
> > > 
> > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > > > 
> > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> > > 
> > > Still thinking about this: can you reorder this to
> > > be the last patch in the series please?
> > sure
> > 
> > > 
> > > Also - 509?
> > userspace memory slots in terms of KVM, I made it match
> > KVM's allotment of memory slots for userspace side.
> 
> Maybe KVM has its reasons for this #. I don't see
> why we need to match this exactly.
np, I can cap it at safe 300 slots but it's unlikely that it
would take cut off 1 extra hop since it's capped by QEMU
at 256+[initial fragmented memory]

> 
> > > I think if we are changing this, it'd be nice to
> > > create a way for userspace to discover the support
> > > and the # of regions supported.
> > That was my first idea before extending KVM's memslots
> > to teach kernel to tell qemu this number so that QEMU
> > at least would be able to check if new memory slot could
> > be added but I was redirected to a more simple solution
> > of just extending vs everdoing things.
> > Currently QEMU supports upto ~250 memslots so 509
> > is about twice high we need it so it should work for near
> > future
> 
> Yes but old kernels are still around. Would be nice if you
> can detect them.
> 
> > but eventually we might still teach kernel and QEMU
> > to make things more robust.
> 
> A new ioctl would be easy to add, I think it's a good
> idea generally.
I can try to do something like this on top of this series.

> 
> > > 
> > > 
> > > > ---
> > > >  drivers/vhost/vhost.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > index 99931a0..6a18c92 100644
> > > > --- a/drivers/vhost/vhost.c
> > > > +++ b/drivers/vhost/vhost.c
> > > > @@ -30,7 +30,7 @@
> > > >  #include "vhost.h"
> > > >  
> > > >  enum {
> > > > -	VHOST_MEMORY_MAX_NREGIONS = 64,
> > > > +	VHOST_MEMORY_MAX_NREGIONS = 509,
> > > >  	VHOST_MEMORY_F_LOG = 0x1,
> > > >  };
> > > >  
> > > > -- 
> > > > 1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 17, 2015, 7:39 a.m. UTC | #5
On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote:
> On Wed, 17 Jun 2015 08:34:26 +0200
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> > > On Tue, 16 Jun 2015 23:14:20 +0200
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > 
> > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > > > since commit
> > > > >  1d4e7e3 kvm: x86: increase user memory slots to 509
> > > > > 
> > > > > it became possible to use a bigger amount of memory
> > > > > slots, which is used by memory hotplug for
> > > > > registering hotplugged memory.
> > > > > However QEMU crashes if it's used with more than ~60
> > > > > pc-dimm devices and vhost-net since host kernel
> > > > > in module vhost-net refuses to accept more than 65
> > > > > memory regions.
> > > > > 
> > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> > > > 
> > > > It was 64, not 65.
> > > > 
> > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > > > > 
> > > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> > > > 
> > > > Still thinking about this: can you reorder this to
> > > > be the last patch in the series please?
> > > sure
> > > 
> > > > 
> > > > Also - 509?
> > > userspace memory slots in terms of KVM, I made it match
> > > KVM's allotment of memory slots for userspace side.
> > 
> > Maybe KVM has its reasons for this #. I don't see
> > why we need to match this exactly.
> np, I can cap it at safe 300 slots but it's unlikely that it
> would take cut off 1 extra hop since it's capped by QEMU
> at 256+[initial fragmented memory]

But what's the point? We allocate 32 bytes per slot.
300*32 = 9600 which is more than 8K, so we are doing
an order-3 allocation anyway.
If we could cap it at 8K (256 slots) that would make sense
since we could avoid wasting vmalloc space.

I'm still not very happy with the whole approach,
giving userspace ability allocate 4 whole pages
of kernel memory like this.

> > 
> > > > I think if we are changing this, it'd be nice to
> > > > create a way for userspace to discover the support
> > > > and the # of regions supported.
> > > That was my first idea before extending KVM's memslots
> > > to teach kernel to tell qemu this number so that QEMU
> > > at least would be able to check if new memory slot could
> > > be added but I was redirected to a more simple solution
> > > of just extending vs everdoing things.
> > > Currently QEMU supports upto ~250 memslots so 509
> > > is about twice high we need it so it should work for near
> > > future
> > 
> > Yes but old kernels are still around. Would be nice if you
> > can detect them.
> > 
> > > but eventually we might still teach kernel and QEMU
> > > to make things more robust.
> > 
> > A new ioctl would be easy to add, I think it's a good
> > idea generally.
> I can try to do something like this on top of this series.
> 
> > 
> > > > 
> > > > 
> > > > > ---
> > > > >  drivers/vhost/vhost.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > index 99931a0..6a18c92 100644
> > > > > --- a/drivers/vhost/vhost.c
> > > > > +++ b/drivers/vhost/vhost.c
> > > > > @@ -30,7 +30,7 @@
> > > > >  #include "vhost.h"
> > > > >  
> > > > >  enum {
> > > > > -	VHOST_MEMORY_MAX_NREGIONS = 64,
> > > > > +	VHOST_MEMORY_MAX_NREGIONS = 509,
> > > > >  	VHOST_MEMORY_F_LOG = 0x1,
> > > > >  };
> > > > >  
> > > > > -- 
> > > > > 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini June 17, 2015, 8:53 a.m. UTC | #6
On 17/06/2015 08:34, Michael S. Tsirkin wrote:
>>> > > 
>>> > > Also - 509?
>> > userspace memory slots in terms of KVM, I made it match
>> > KVM's allotment of memory slots for userspace side.
> Maybe KVM has its reasons for this #.

Nice power of two (512) - number of reserved slots required by Intel's
virtualization extensions (3: APIC access page, EPT identity page table,
VMX task state segment).

Paolo

 I don't see
> why we need to match this exactly.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Igor Mammedov June 17, 2015, 8:54 a.m. UTC | #7
On Wed, 17 Jun 2015 09:39:06 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 08:34:26 +0200
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> > > > On Tue, 16 Jun 2015 23:14:20 +0200
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > 
> > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > > > > since commit
> > > > > >  1d4e7e3 kvm: x86: increase user memory slots to 509
> > > > > > 
> > > > > > it became possible to use a bigger amount of memory
> > > > > > slots, which is used by memory hotplug for
> > > > > > registering hotplugged memory.
> > > > > > However QEMU crashes if it's used with more than ~60
> > > > > > pc-dimm devices and vhost-net since host kernel
> > > > > > in module vhost-net refuses to accept more than 65
> > > > > > memory regions.
> > > > > > 
> > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> > > > > 
> > > > > It was 64, not 65.
> > > > > 
> > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > > > > > 
> > > > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> > > > > 
> > > > > Still thinking about this: can you reorder this to
> > > > > be the last patch in the series please?
> > > > sure
> > > > 
> > > > > 
> > > > > Also - 509?
> > > > userspace memory slots in terms of KVM, I made it match
> > > > KVM's allotment of memory slots for userspace side.
> > > 
> > > Maybe KVM has its reasons for this #. I don't see
> > > why we need to match this exactly.
> > np, I can cap it at safe 300 slots but it's unlikely that it
> > would take cut off 1 extra hop since it's capped by QEMU
> > at 256+[initial fragmented memory]
> 
> But what's the point? We allocate 32 bytes per slot.
> 300*32 = 9600 which is more than 8K, so we are doing
> an order-3 allocation anyway.
> If we could cap it at 8K (256 slots) that would make sense
> since we could avoid wasting vmalloc space.
256 is amount of hotpluggable slots  and there is no way
to predict how initial memory would be fragmented
(i.e. amount of slots it would take), if we guess wrong
we are back to square one with crashing userspace.
So I'd stay consistent with KVM's limit 509 since
it's only limit, i.e. not actual amount of allocated slots.

> I'm still not very happy with the whole approach,
> giving userspace ability allocate 4 whole pages
> of kernel memory like this.
I'm working in parallel so that userspace won't take so
many slots but it won't prevent its current versions
crashing due to kernel limitation.

 
> > > > > I think if we are changing this, it'd be nice to
> > > > > create a way for userspace to discover the support
> > > > > and the # of regions supported.
> > > > That was my first idea before extending KVM's memslots
> > > > to teach kernel to tell qemu this number so that QEMU
> > > > at least would be able to check if new memory slot could
> > > > be added but I was redirected to a more simple solution
> > > > of just extending vs everdoing things.
> > > > Currently QEMU supports upto ~250 memslots so 509
> > > > is about twice high we need it so it should work for near
> > > > future
> > > 
> > > Yes but old kernels are still around. Would be nice if you
> > > can detect them.
> > > 
> > > > but eventually we might still teach kernel and QEMU
> > > > to make things more robust.
> > > 
> > > A new ioctl would be easy to add, I think it's a good
> > > idea generally.
> > I can try to do something like this on top of this series.
> > 
> > > 
> > > > > 
> > > > > 
> > > > > > ---
> > > > > >  drivers/vhost/vhost.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > index 99931a0..6a18c92 100644
> > > > > > --- a/drivers/vhost/vhost.c
> > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > @@ -30,7 +30,7 @@
> > > > > >  #include "vhost.h"
> > > > > >  
> > > > > >  enum {
> > > > > > -	VHOST_MEMORY_MAX_NREGIONS = 64,
> > > > > > +	VHOST_MEMORY_MAX_NREGIONS = 509,
> > > > > >  	VHOST_MEMORY_F_LOG = 0x1,
> > > > > >  };
> > > > > >  
> > > > > > -- 
> > > > > > 1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 17, 2015, 10:11 a.m. UTC | #8
On Wed, Jun 17, 2015 at 10:54:21AM +0200, Igor Mammedov wrote:
> On Wed, 17 Jun 2015 09:39:06 +0200
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote:
> > > On Wed, 17 Jun 2015 08:34:26 +0200
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > 
> > > > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> > > > > On Tue, 16 Jun 2015 23:14:20 +0200
> > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > 
> > > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > > > > > since commit
> > > > > > >  1d4e7e3 kvm: x86: increase user memory slots to 509
> > > > > > > 
> > > > > > > it became possible to use a bigger amount of memory
> > > > > > > slots, which is used by memory hotplug for
> > > > > > > registering hotplugged memory.
> > > > > > > However QEMU crashes if it's used with more than ~60
> > > > > > > pc-dimm devices and vhost-net since host kernel
> > > > > > > in module vhost-net refuses to accept more than 65
> > > > > > > memory regions.
> > > > > > > 
> > > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> > > > > > 
> > > > > > It was 64, not 65.
> > > > > > 
> > > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > > > > > > 
> > > > > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> > > > > > 
> > > > > > Still thinking about this: can you reorder this to
> > > > > > be the last patch in the series please?
> > > > > sure
> > > > > 
> > > > > > 
> > > > > > Also - 509?
> > > > > userspace memory slots in terms of KVM, I made it match
> > > > > KVM's allotment of memory slots for userspace side.
> > > > 
> > > > Maybe KVM has its reasons for this #. I don't see
> > > > why we need to match this exactly.
> > > np, I can cap it at safe 300 slots but it's unlikely that it
> > > would take cut off 1 extra hop since it's capped by QEMU
> > > at 256+[initial fragmented memory]
> > 
> > But what's the point? We allocate 32 bytes per slot.
> > 300*32 = 9600 which is more than 8K, so we are doing
> > an order-3 allocation anyway.
> > If we could cap it at 8K (256 slots) that would make sense
> > since we could avoid wasting vmalloc space.
> 256 is amount of hotpluggable slots  and there is no way
> to predict how initial memory would be fragmented
> (i.e. amount of slots it would take), if we guess wrong
> we are back to square one with crashing userspace.
> So I'd stay consistent with KVM's limit 509 since
> it's only limit, i.e. not actual amount of allocated slots.
> 
> > I'm still not very happy with the whole approach,
> > giving userspace ability allocate 4 whole pages
> > of kernel memory like this.
> I'm working in parallel so that userspace won't take so
> many slots but it won't prevent its current versions
> crashing due to kernel limitation.

Right but at least it's not a regression. If we promise userspace to
support a ton of regions, we can't take it back later, and I'm concerned
about the memory usage.

I think it's already safe to merge the binary lookup patches, and maybe
cache and vmalloc, so that the remaining patch will be small.

>  
> > > > > > I think if we are changing this, it'd be nice to
> > > > > > create a way for userspace to discover the support
> > > > > > and the # of regions supported.
> > > > > That was my first idea before extending KVM's memslots
> > > > > to teach kernel to tell qemu this number so that QEMU
> > > > > at least would be able to check if new memory slot could
> > > > > be added but I was redirected to a more simple solution
> > > > > of just extending vs everdoing things.
> > > > > Currently QEMU supports upto ~250 memslots so 509
> > > > > is about twice high we need it so it should work for near
> > > > > future
> > > > 
> > > > Yes but old kernels are still around. Would be nice if you
> > > > can detect them.
> > > > 
> > > > > but eventually we might still teach kernel and QEMU
> > > > > to make things more robust.
> > > > 
> > > > A new ioctl would be easy to add, I think it's a good
> > > > idea generally.
> > > I can try to do something like this on top of this series.
> > > 
> > > > 
> > > > > > 
> > > > > > 
> > > > > > > ---
> > > > > > >  drivers/vhost/vhost.c | 2 +-
> > > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > > index 99931a0..6a18c92 100644
> > > > > > > --- a/drivers/vhost/vhost.c
> > > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > > @@ -30,7 +30,7 @@
> > > > > > >  #include "vhost.h"
> > > > > > >  
> > > > > > >  enum {
> > > > > > > -	VHOST_MEMORY_MAX_NREGIONS = 64,
> > > > > > > +	VHOST_MEMORY_MAX_NREGIONS = 509,
> > > > > > >  	VHOST_MEMORY_F_LOG = 0x1,
> > > > > > >  };
> > > > > > >  
> > > > > > > -- 
> > > > > > > 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Igor Mammedov June 17, 2015, 10:37 a.m. UTC | #9
On Wed, 17 Jun 2015 12:11:09 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jun 17, 2015 at 10:54:21AM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 09:39:06 +0200
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote:
> > > > On Wed, 17 Jun 2015 08:34:26 +0200
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > 
> > > > > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> > > > > > On Tue, 16 Jun 2015 23:14:20 +0200
> > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > 
> > > > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > > > > > > since commit
> > > > > > > >  1d4e7e3 kvm: x86: increase user memory slots to 509
> > > > > > > > 
> > > > > > > > it became possible to use a bigger amount of memory
> > > > > > > > slots, which is used by memory hotplug for
> > > > > > > > registering hotplugged memory.
> > > > > > > > However QEMU crashes if it's used with more than ~60
> > > > > > > > pc-dimm devices and vhost-net since host kernel
> > > > > > > > in module vhost-net refuses to accept more than 65
> > > > > > > > memory regions.
> > > > > > > > 
> > > > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> > > > > > > 
> > > > > > > It was 64, not 65.
> > > > > > > 
> > > > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > > > > > > > 
> > > > > > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> > > > > > > 
> > > > > > > Still thinking about this: can you reorder this to
> > > > > > > be the last patch in the series please?
> > > > > > sure
> > > > > > 
> > > > > > > 
> > > > > > > Also - 509?
> > > > > > userspace memory slots in terms of KVM, I made it match
> > > > > > KVM's allotment of memory slots for userspace side.
> > > > > 
> > > > > Maybe KVM has its reasons for this #. I don't see
> > > > > why we need to match this exactly.
> > > > np, I can cap it at safe 300 slots but it's unlikely that it
> > > > would take cut off 1 extra hop since it's capped by QEMU
> > > > at 256+[initial fragmented memory]
> > > 
> > > But what's the point? We allocate 32 bytes per slot.
> > > 300*32 = 9600 which is more than 8K, so we are doing
> > > an order-3 allocation anyway.
> > > If we could cap it at 8K (256 slots) that would make sense
> > > since we could avoid wasting vmalloc space.
> > 256 is amount of hotpluggable slots  and there is no way
> > to predict how initial memory would be fragmented
> > (i.e. amount of slots it would take), if we guess wrong
> > we are back to square one with crashing userspace.
> > So I'd stay consistent with KVM's limit 509 since
> > it's only limit, i.e. not actual amount of allocated slots.
> > 
> > > I'm still not very happy with the whole approach,
> > > giving userspace ability allocate 4 whole pages
> > > of kernel memory like this.
> > I'm working in parallel so that userspace won't take so
> > many slots but it won't prevent its current versions
> > crashing due to kernel limitation.
> 
> Right but at least it's not a regression. If we promise userspace to
> support a ton of regions, we can't take it back later, and I'm concerned
> about the memory usage.
> 
> I think it's already safe to merge the binary lookup patches, and maybe
> cache and vmalloc, so that the remaining patch will be small.
it isn't regression with switching to binary search and increasing
slots to 509 either performance wise it's more on improvment side.
And I was thinking about memory usage as well, that's why I've dropped
faster radix tree in favor of more compact array, can't do better
on kernel side of fix.

Yes we will give userspace to ability to use more slots/and lock up
more memory if it's not able to consolidate memory regions but
that leaves an option for user to run guest with vhost performance
vs crashing it at runtime.

userspace/targets that could consolidate memory regions should
do so and I'm working on that as well but that doesn't mean
that users shouldn't have a choice.
So far it's kernel limitation and this patch fixes crashes
that users see now, with the rest of patches enabling performance
not to regress.

> 
> >  
> > > > > > > I think if we are changing this, it'd be nice to
> > > > > > > create a way for userspace to discover the support
> > > > > > > and the # of regions supported.
> > > > > > That was my first idea before extending KVM's memslots
> > > > > > to teach kernel to tell qemu this number so that QEMU
> > > > > > at least would be able to check if new memory slot could
> > > > > > be added but I was redirected to a more simple solution
> > > > > > of just extending vs everdoing things.
> > > > > > Currently QEMU supports upto ~250 memslots so 509
> > > > > > is about twice high we need it so it should work for near
> > > > > > future
> > > > > 
> > > > > Yes but old kernels are still around. Would be nice if you
> > > > > can detect them.
> > > > > 
> > > > > > but eventually we might still teach kernel and QEMU
> > > > > > to make things more robust.
> > > > > 
> > > > > A new ioctl would be easy to add, I think it's a good
> > > > > idea generally.
> > > > I can try to do something like this on top of this series.
> > > > 
> > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > ---
> > > > > > > >  drivers/vhost/vhost.c | 2 +-
> > > > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > > > index 99931a0..6a18c92 100644
> > > > > > > > --- a/drivers/vhost/vhost.c
> > > > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > > > @@ -30,7 +30,7 @@
> > > > > > > >  #include "vhost.h"
> > > > > > > >  
> > > > > > > >  enum {
> > > > > > > > -	VHOST_MEMORY_MAX_NREGIONS = 64,
> > > > > > > > +	VHOST_MEMORY_MAX_NREGIONS = 509,
> > > > > > > >  	VHOST_MEMORY_F_LOG = 0x1,
> > > > > > > >  };
> > > > > > > >  
> > > > > > > > -- 
> > > > > > > > 1.8.3.1
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 17, 2015, 10:46 a.m. UTC | #10
On Wed, Jun 17, 2015 at 12:37:42PM +0200, Igor Mammedov wrote:
> On Wed, 17 Jun 2015 12:11:09 +0200
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Wed, Jun 17, 2015 at 10:54:21AM +0200, Igor Mammedov wrote:
> > > On Wed, 17 Jun 2015 09:39:06 +0200
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > 
> > > > On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote:
> > > > > On Wed, 17 Jun 2015 08:34:26 +0200
> > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > 
> > > > > > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> > > > > > > On Tue, 16 Jun 2015 23:14:20 +0200
> > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > > 
> > > > > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > > > > > > > since commit
> > > > > > > > >  1d4e7e3 kvm: x86: increase user memory slots to 509
> > > > > > > > > 
> > > > > > > > > it became possible to use a bigger amount of memory
> > > > > > > > > slots, which is used by memory hotplug for
> > > > > > > > > registering hotplugged memory.
> > > > > > > > > However QEMU crashes if it's used with more than ~60
> > > > > > > > > pc-dimm devices and vhost-net since host kernel
> > > > > > > > > in module vhost-net refuses to accept more than 65
> > > > > > > > > memory regions.
> > > > > > > > > 
> > > > > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> > > > > > > > 
> > > > > > > > It was 64, not 65.
> > > > > > > > 
> > > > > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> > > > > > > > 
> > > > > > > > Still thinking about this: can you reorder this to
> > > > > > > > be the last patch in the series please?
> > > > > > > sure
> > > > > > > 
> > > > > > > > 
> > > > > > > > Also - 509?
> > > > > > > userspace memory slots in terms of KVM, I made it match
> > > > > > > KVM's allotment of memory slots for userspace side.
> > > > > > 
> > > > > > Maybe KVM has its reasons for this #. I don't see
> > > > > > why we need to match this exactly.
> > > > > np, I can cap it at safe 300 slots but it's unlikely that it
> > > > > would take cut off 1 extra hop since it's capped by QEMU
> > > > > at 256+[initial fragmented memory]
> > > > 
> > > > But what's the point? We allocate 32 bytes per slot.
> > > > 300*32 = 9600 which is more than 8K, so we are doing
> > > > an order-3 allocation anyway.
> > > > If we could cap it at 8K (256 slots) that would make sense
> > > > since we could avoid wasting vmalloc space.
> > > 256 is amount of hotpluggable slots  and there is no way
> > > to predict how initial memory would be fragmented
> > > (i.e. amount of slots it would take), if we guess wrong
> > > we are back to square one with crashing userspace.
> > > So I'd stay consistent with KVM's limit 509 since
> > > it's only limit, i.e. not actual amount of allocated slots.
> > > 
> > > > I'm still not very happy with the whole approach,
> > > > giving userspace ability allocate 4 whole pages
> > > > of kernel memory like this.
> > > I'm working in parallel so that userspace won't take so
> > > many slots but it won't prevent its current versions
> > > crashing due to kernel limitation.
> > 
> > Right but at least it's not a regression. If we promise userspace to
> > support a ton of regions, we can't take it back later, and I'm concerned
> > about the memory usage.
> > 
> > I think it's already safe to merge the binary lookup patches, and maybe
> > cache and vmalloc, so that the remaining patch will be small.
> it isn't regression with switching to binary search and increasing
> slots to 509 either performance wise it's more on improvment side.
> And I was thinking about memory usage as well, that's why I've dropped
> faster radix tree in favor of more compact array, can't do better
> on kernel side of fix.
> 
> Yes we will give userspace to ability to use more slots/and lock up
> more memory if it's not able to consolidate memory regions but
> that leaves an option for user to run guest with vhost performance
> vs crashing it at runtime.

Crashing is entirely QEMU's own doing in not handling
the error gracefully.

> 
> userspace/targets that could consolidate memory regions should
> do so and I'm working on that as well but that doesn't mean
> that users shouldn't have a choice.

It's a fairly unusual corner case, I'm not yet
convinced we need to quickly add support to it when just waiting a bit
longer will get us an equivalent (or even more efficient) fix in
userspace.

> So far it's kernel limitation and this patch fixes crashes
> that users see now, with the rest of patches enabling performance
> not to regress.

When I say regression I refer to an option to limit the array
size again after userspace started using the larger size.


> > 
> > >  
> > > > > > > > I think if we are changing this, it'd be nice to
> > > > > > > > create a way for userspace to discover the support
> > > > > > > > and the # of regions supported.
> > > > > > > That was my first idea before extending KVM's memslots
> > > > > > > to teach kernel to tell qemu this number so that QEMU
> > > > > > > at least would be able to check if new memory slot could
> > > > > > > be added but I was redirected to a more simple solution
> > > > > > > of just extending vs everdoing things.
> > > > > > > Currently QEMU supports upto ~250 memslots so 509
> > > > > > > is about twice high we need it so it should work for near
> > > > > > > future
> > > > > > 
> > > > > > Yes but old kernels are still around. Would be nice if you
> > > > > > can detect them.
> > > > > > 
> > > > > > > but eventually we might still teach kernel and QEMU
> > > > > > > to make things more robust.
> > > > > > 
> > > > > > A new ioctl would be easy to add, I think it's a good
> > > > > > idea generally.
> > > > > I can try to do something like this on top of this series.
> > > > > 
> > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > ---
> > > > > > > > >  drivers/vhost/vhost.c | 2 +-
> > > > > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > > > > index 99931a0..6a18c92 100644
> > > > > > > > > --- a/drivers/vhost/vhost.c
> > > > > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > > > > @@ -30,7 +30,7 @@
> > > > > > > > >  #include "vhost.h"
> > > > > > > > >  
> > > > > > > > >  enum {
> > > > > > > > > -	VHOST_MEMORY_MAX_NREGIONS = 64,
> > > > > > > > > +	VHOST_MEMORY_MAX_NREGIONS = 509,
> > > > > > > > >  	VHOST_MEMORY_F_LOG = 0x1,
> > > > > > > > >  };
> > > > > > > > >  
> > > > > > > > > -- 
> > > > > > > > > 1.8.3.1
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Igor Mammedov June 17, 2015, 11:48 a.m. UTC | #11
On Wed, 17 Jun 2015 12:46:09 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jun 17, 2015 at 12:37:42PM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 12:11:09 +0200
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Wed, Jun 17, 2015 at 10:54:21AM +0200, Igor Mammedov wrote:
> > > > On Wed, 17 Jun 2015 09:39:06 +0200
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > 
> > > > > On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote:
> > > > > > On Wed, 17 Jun 2015 08:34:26 +0200
> > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > 
> > > > > > > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> > > > > > > > On Tue, 16 Jun 2015 23:14:20 +0200
> > > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > > > 
> > > > > > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > > > > > > > > since commit
> > > > > > > > > >  1d4e7e3 kvm: x86: increase user memory slots to 509
> > > > > > > > > > 
> > > > > > > > > > it became possible to use a bigger amount of memory
> > > > > > > > > > slots, which is used by memory hotplug for
> > > > > > > > > > registering hotplugged memory.
> > > > > > > > > > However QEMU crashes if it's used with more than ~60
> > > > > > > > > > pc-dimm devices and vhost-net since host kernel
> > > > > > > > > > in module vhost-net refuses to accept more than 65
> > > > > > > > > > memory regions.
> > > > > > > > > > 
> > > > > > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> > > > > > > > > 
> > > > > > > > > It was 64, not 65.
> > > > > > > > > 
> > > > > > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> > > > > > > > > 
> > > > > > > > > Still thinking about this: can you reorder this to
> > > > > > > > > be the last patch in the series please?
> > > > > > > > sure
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Also - 509?
> > > > > > > > userspace memory slots in terms of KVM, I made it match
> > > > > > > > KVM's allotment of memory slots for userspace side.
> > > > > > > 
> > > > > > > Maybe KVM has its reasons for this #. I don't see
> > > > > > > why we need to match this exactly.
> > > > > > np, I can cap it at safe 300 slots but it's unlikely that it
> > > > > > would take cut off 1 extra hop since it's capped by QEMU
> > > > > > at 256+[initial fragmented memory]
> > > > > 
> > > > > But what's the point? We allocate 32 bytes per slot.
> > > > > 300*32 = 9600 which is more than 8K, so we are doing
> > > > > an order-3 allocation anyway.
> > > > > If we could cap it at 8K (256 slots) that would make sense
> > > > > since we could avoid wasting vmalloc space.
> > > > 256 is amount of hotpluggable slots  and there is no way
> > > > to predict how initial memory would be fragmented
> > > > (i.e. amount of slots it would take), if we guess wrong
> > > > we are back to square one with crashing userspace.
> > > > So I'd stay consistent with KVM's limit 509 since
> > > > it's only limit, i.e. not actual amount of allocated slots.
> > > > 
> > > > > I'm still not very happy with the whole approach,
> > > > > giving userspace ability allocate 4 whole pages
> > > > > of kernel memory like this.
> > > > I'm working in parallel so that userspace won't take so
> > > > many slots but it won't prevent its current versions
> > > > crashing due to kernel limitation.
> > > 
> > > Right but at least it's not a regression. If we promise userspace to
> > > support a ton of regions, we can't take it back later, and I'm concerned
> > > about the memory usage.
> > > 
> > > I think it's already safe to merge the binary lookup patches, and maybe
> > > cache and vmalloc, so that the remaining patch will be small.
> > it isn't regression with switching to binary search and increasing
> > slots to 509 either performance wise it's more on improvment side.
> > And I was thinking about memory usage as well, that's why I've dropped
> > faster radix tree in favor of more compact array, can't do better
> > on kernel side of fix.
> > 
> > Yes we will give userspace to ability to use more slots/and lock up
> > more memory if it's not able to consolidate memory regions but
> > that leaves an option for user to run guest with vhost performance
> > vs crashing it at runtime.
> 
> Crashing is entirely QEMU's own doing in not handling
> the error gracefully.
and that's hard to fix (handle error gracefully) the way it's implemented now.

> > 
> > userspace/targets that could consolidate memory regions should
> > do so and I'm working on that as well but that doesn't mean
> > that users shouldn't have a choice.
> 
> It's a fairly unusual corner case, I'm not yet
> convinced we need to quickly add support to it when just waiting a bit
> longer will get us an equivalent (or even more efficient) fix in
> userspace.
with memory hotplug support in QEMU has been released for quite
a long time already and there is users that use it so fix in
the future QEMU won't make it work with their distros.
So I wouldn't say that is "fairly unusual corner case".

> 
> > So far it's kernel limitation and this patch fixes crashes
> > that users see now, with the rest of patches enabling performance
> > not to regress.
> 
> When I say regression I refer to an option to limit the array
> size again after userspace started using the larger size.
Is there a need to do so?

Userspace that cares about memory footprint won't use many slots
keeping it low and user space that can't do without many slots
or doesn't care will have bigger memory footprint.

> 
> 
> > > 
> > > >  
> > > > > > > > > I think if we are changing this, it'd be nice to
> > > > > > > > > create a way for userspace to discover the support
> > > > > > > > > and the # of regions supported.
> > > > > > > > That was my first idea before extending KVM's memslots
> > > > > > > > to teach kernel to tell qemu this number so that QEMU
> > > > > > > > at least would be able to check if new memory slot could
> > > > > > > > be added but I was redirected to a more simple solution
> > > > > > > > of just extending vs everdoing things.
> > > > > > > > Currently QEMU supports upto ~250 memslots so 509
> > > > > > > > is about twice high we need it so it should work for near
> > > > > > > > future
> > > > > > > 
> > > > > > > Yes but old kernels are still around. Would be nice if you
> > > > > > > can detect them.
> > > > > > > 
> > > > > > > > but eventually we might still teach kernel and QEMU
> > > > > > > > to make things more robust.
> > > > > > > 
> > > > > > > A new ioctl would be easy to add, I think it's a good
> > > > > > > idea generally.
> > > > > > I can try to do something like this on top of this series.
> > > > > > 
> > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > ---
> > > > > > > > > >  drivers/vhost/vhost.c | 2 +-
> > > > > > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > > > > > index 99931a0..6a18c92 100644
> > > > > > > > > > --- a/drivers/vhost/vhost.c
> > > > > > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > > > > > @@ -30,7 +30,7 @@
> > > > > > > > > >  #include "vhost.h"
> > > > > > > > > >  
> > > > > > > > > >  enum {
> > > > > > > > > > -	VHOST_MEMORY_MAX_NREGIONS = 64,
> > > > > > > > > > +	VHOST_MEMORY_MAX_NREGIONS = 509,
> > > > > > > > > >  	VHOST_MEMORY_F_LOG = 0x1,
> > > > > > > > > >  };
> > > > > > > > > >  
> > > > > > > > > > -- 
> > > > > > > > > > 1.8.3.1
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 17, 2015, 11:51 a.m. UTC | #12
On Wed, Jun 17, 2015 at 01:48:03PM +0200, Igor Mammedov wrote:
> > > So far it's kernel limitation and this patch fixes crashes
> > > that users see now, with the rest of patches enabling performance
> > > not to regress.
> > 
> > When I say regression I refer to an option to limit the array
> > size again after userspace started using the larger size.
> Is there a need to do so?

Considering userspace can be malicious, I guess yes.

> Userspace that cares about memory footprint won't use many slots
> keeping it low and user space that can't do without many slots
> or doesn't care will have bigger memory footprint.

We really can't trust userspace to do the right thing though.
Igor Mammedov June 17, 2015, 12:23 p.m. UTC | #13
On Wed, 17 Jun 2015 13:51:56 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jun 17, 2015 at 01:48:03PM +0200, Igor Mammedov wrote:
> > > > So far it's kernel limitation and this patch fixes crashes
> > > > that users see now, with the rest of patches enabling performance
> > > > not to regress.
> > > 
> > > When I say regression I refer to an option to limit the array
> > > size again after userspace started using the larger size.
> > Is there a need to do so?
> 
> Considering userspace can be malicious, I guess yes.
I don't think it's a valid concern in this case,
setting limit back from 509 to 64 will not help here in any way,
userspace still can create as many vhost instances as it needs
to consume memory it desires.

> 
> > Userspace that cares about memory footprint won't use many slots
> > keeping it low and user space that can't do without many slots
> > or doesn't care will have bigger memory footprint.
> 
> We really can't trust userspace to do the right thing though.
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 17, 2015, 1:13 p.m. UTC | #14
On Wed, Jun 17, 2015 at 02:23:39PM +0200, Igor Mammedov wrote:
> On Wed, 17 Jun 2015 13:51:56 +0200
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Wed, Jun 17, 2015 at 01:48:03PM +0200, Igor Mammedov wrote:
> > > > > So far it's kernel limitation and this patch fixes crashes
> > > > > that users see now, with the rest of patches enabling performance
> > > > > not to regress.
> > > > 
> > > > When I say regression I refer to an option to limit the array
> > > > size again after userspace started using the larger size.
> > > Is there a need to do so?
> > 
> > Considering userspace can be malicious, I guess yes.
> I don't think it's a valid concern in this case,
> setting limit back from 509 to 64 will not help here in any way,
> userspace still can create as many vhost instances as it needs
> to consume memory it desires.

Not really since vhost char device isn't world-accessible.
It's typically opened by a priveledged tool, the fd is
then passed to an unpriveledged userspace, or permissions dropped.

> > 
> > > Userspace that cares about memory footprint won't use many slots
> > > keeping it low and user space that can't do without many slots
> > > or doesn't care will have bigger memory footprint.
> > 
> > We really can't trust userspace to do the right thing though.
> > 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini June 17, 2015, 1:20 p.m. UTC | #15
On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > Considering userspace can be malicious, I guess yes.
> > I don't think it's a valid concern in this case,
> > setting limit back from 509 to 64 will not help here in any way,
> > userspace still can create as many vhost instances as it needs
> > to consume memory it desires.
> 
> Not really since vhost char device isn't world-accessible.
> It's typically opened by a priveledged tool, the fd is
> then passed to an unpriveledged userspace, or permissions dropped.

Then what's the concern anyway?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 17, 2015, 2:32 p.m. UTC | #16
On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> 
> 
> On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > Considering userspace can be malicious, I guess yes.
> > > I don't think it's a valid concern in this case,
> > > setting limit back from 509 to 64 will not help here in any way,
> > > userspace still can create as many vhost instances as it needs
> > > to consume memory it desires.
> > 
> > Not really since vhost char device isn't world-accessible.
> > It's typically opened by a priveledged tool, the fd is
> > then passed to an unpriveledged userspace, or permissions dropped.
> 
> Then what's the concern anyway?
> 
> Paolo

Each fd now ties up 16K of kernel memory.  It didn't use to, so
priveledged tool could safely give the unpriveledged userspace
a ton of these fds.
Igor Mammedov June 17, 2015, 3:12 p.m. UTC | #17
On Wed, 17 Jun 2015 16:32:02 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > 
> > 
> > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > Considering userspace can be malicious, I guess yes.
> > > > I don't think it's a valid concern in this case,
> > > > setting limit back from 509 to 64 will not help here in any way,
> > > > userspace still can create as many vhost instances as it needs
> > > > to consume memory it desires.
> > > 
> > > Not really since vhost char device isn't world-accessible.
> > > It's typically opened by a priveledged tool, the fd is
> > > then passed to an unpriveledged userspace, or permissions dropped.
> > 
> > Then what's the concern anyway?
> > 
> > Paolo
> 
> Each fd now ties up 16K of kernel memory.  It didn't use to, so
> priveledged tool could safely give the unpriveledged userspace
> a ton of these fds.
if privileged tool gives out unlimited amount of fds then it
doesn't matter whether fd ties 4K or 16K, host still could be DoSed.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 17, 2015, 3:38 p.m. UTC | #18
On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
> On Wed, 17 Jun 2015 16:32:02 +0200
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > > 
> > > 
> > > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > > Considering userspace can be malicious, I guess yes.
> > > > > I don't think it's a valid concern in this case,
> > > > > setting limit back from 509 to 64 will not help here in any way,
> > > > > userspace still can create as many vhost instances as it needs
> > > > > to consume memory it desires.
> > > > 
> > > > Not really since vhost char device isn't world-accessible.
> > > > It's typically opened by a priveledged tool, the fd is
> > > > then passed to an unpriveledged userspace, or permissions dropped.
> > > 
> > > Then what's the concern anyway?
> > > 
> > > Paolo
> > 
> > Each fd now ties up 16K of kernel memory.  It didn't use to, so
> > priveledged tool could safely give the unpriveledged userspace
> > a ton of these fds.
> if privileged tool gives out unlimited amount of fds then it
> doesn't matter whether fd ties 4K or 16K, host still could be DoSed.
> 

Of course it does not give out unlimited fds, there's a way
for the sysadmin to specify the number of fds. Look at how libvirt
uses vhost, it should become clear I think.
Igor Mammedov June 17, 2015, 4:09 p.m. UTC | #19
On Wed, 17 Jun 2015 17:38:40 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 16:32:02 +0200
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > > > 
> > > > 
> > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > > > Considering userspace can be malicious, I guess yes.
> > > > > > I don't think it's a valid concern in this case,
> > > > > > setting limit back from 509 to 64 will not help here in any
> > > > > > way, userspace still can create as many vhost instances as
> > > > > > it needs to consume memory it desires.
> > > > > 
> > > > > Not really since vhost char device isn't world-accessible.
> > > > > It's typically opened by a priveledged tool, the fd is
> > > > > then passed to an unpriveledged userspace, or permissions
> > > > > dropped.
> > > > 
> > > > Then what's the concern anyway?
> > > > 
> > > > Paolo
> > > 
> > > Each fd now ties up 16K of kernel memory.  It didn't use to, so
> > > priveledged tool could safely give the unpriveledged userspace
> > > a ton of these fds.
> > if privileged tool gives out unlimited amount of fds then it
> > doesn't matter whether fd ties 4K or 16K, host still could be DoSed.
> > 
> 
> Of course it does not give out unlimited fds, there's a way
> for the sysadmin to specify the number of fds. Look at how libvirt
> uses vhost, it should become clear I think.
then it just means that tool has to take into account a new limits
to partition host in sensible manner.
Exposing limit as module parameter might be of help to tool for
getting/setting it in a way it needs.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 17, 2015, 4:30 p.m. UTC | #20
On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote:
> On Wed, 17 Jun 2015 17:38:40 +0200
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
> > > On Wed, 17 Jun 2015 16:32:02 +0200
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > 
> > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > > > > 
> > > > > 
> > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > > > > Considering userspace can be malicious, I guess yes.
> > > > > > > I don't think it's a valid concern in this case,
> > > > > > > setting limit back from 509 to 64 will not help here in any
> > > > > > > way, userspace still can create as many vhost instances as
> > > > > > > it needs to consume memory it desires.
> > > > > > 
> > > > > > Not really since vhost char device isn't world-accessible.
> > > > > > It's typically opened by a priveledged tool, the fd is
> > > > > > then passed to an unpriveledged userspace, or permissions
> > > > > > dropped.
> > > > > 
> > > > > Then what's the concern anyway?
> > > > > 
> > > > > Paolo
> > > > 
> > > > Each fd now ties up 16K of kernel memory.  It didn't use to, so
> > > > priveledged tool could safely give the unpriveledged userspace
> > > > a ton of these fds.
> > > if privileged tool gives out unlimited amount of fds then it
> > > doesn't matter whether fd ties 4K or 16K, host still could be DoSed.
> > > 
> > 
> > Of course it does not give out unlimited fds, there's a way
> > for the sysadmin to specify the number of fds. Look at how libvirt
> > uses vhost, it should become clear I think.
> then it just means that tool has to take into account a new limits
> to partition host in sensible manner.

Meanwhile old tools are vulnerable to OOM attacks.

> Exposing limit as module parameter might be of help to tool for
> getting/setting it in a way it needs.
Paolo Bonzini June 17, 2015, 4:31 p.m. UTC | #21
On 17/06/2015 18:30, Michael S. Tsirkin wrote:
> Meanwhile old tools are vulnerable to OOM attacks.

For each vhost device there will be likely one tap interface, and I
suspect that it takes way, way more than 16KB of memory.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 17, 2015, 4:34 p.m. UTC | #22
On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote:
> 
> 
> On 17/06/2015 18:30, Michael S. Tsirkin wrote:
> > Meanwhile old tools are vulnerable to OOM attacks.
> 
> For each vhost device there will be likely one tap interface, and I
> suspect that it takes way, way more than 16KB of memory.
> 
> Paolo

That's not true. We have a vhost device per queue, all queues
are part of a single tap device.
Paolo Bonzini June 17, 2015, 4:38 p.m. UTC | #23
On 17/06/2015 18:34, Michael S. Tsirkin wrote:
> On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 17/06/2015 18:30, Michael S. Tsirkin wrote:
>>> Meanwhile old tools are vulnerable to OOM attacks.
>>
>> For each vhost device there will be likely one tap interface, and I
>> suspect that it takes way, way more than 16KB of memory.
> 
> That's not true. We have a vhost device per queue, all queues
> are part of a single tap device.

s/tap/VCPU/ then.  A KVM VCPU also takes more than 16KB of memory.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 17, 2015, 4:41 p.m. UTC | #24
On Wed, Jun 17, 2015 at 06:38:25PM +0200, Paolo Bonzini wrote:
> 
> 
> On 17/06/2015 18:34, Michael S. Tsirkin wrote:
> > On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 17/06/2015 18:30, Michael S. Tsirkin wrote:
> >>> Meanwhile old tools are vulnerable to OOM attacks.
> >>
> >> For each vhost device there will be likely one tap interface, and I
> >> suspect that it takes way, way more than 16KB of memory.
> > 
> > That's not true. We have a vhost device per queue, all queues
> > are part of a single tap device.
> 
> s/tap/VCPU/ then.  A KVM VCPU also takes more than 16KB of memory.
> 
> Paolo

That's up to you as a kvm maintainer :)
People are already concerned about vhost device
memory usage, I'm not happy to define our user/kernel interface
in a way that forces even more memory to be used up.
Paolo Bonzini June 17, 2015, 4:47 p.m. UTC | #25
On 17/06/2015 18:41, Michael S. Tsirkin wrote:
> On Wed, Jun 17, 2015 at 06:38:25PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 17/06/2015 18:34, Michael S. Tsirkin wrote:
>>> On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote:
>>>>
>>>>
>>>> On 17/06/2015 18:30, Michael S. Tsirkin wrote:
>>>>> Meanwhile old tools are vulnerable to OOM attacks.
>>>>
>>>> For each vhost device there will be likely one tap interface, and I
>>>> suspect that it takes way, way more than 16KB of memory.
>>>
>>> That's not true. We have a vhost device per queue, all queues
>>> are part of a single tap device.
>>
>> s/tap/VCPU/ then.  A KVM VCPU also takes more than 16KB of memory.
> 
> That's up to you as a kvm maintainer :)

Not easy, when the CPU alone requires three (albeit non-consecutive)
pages for the VMCS, the APIC access page and the EPT root.

> People are already concerned about vhost device
> memory usage, I'm not happy to define our user/kernel interface
> in a way that forces even more memory to be used up.

So, the questions to ask are:

1) What is the memory usage like immediately after vhost is brought up,
apart from these 16K?

2) Is there anything in vhost that allocates a user-controllable amount
of memory?

3) What is the size of the data structures that support one virtqueue
(there are two of them)?  Does it depend on the size of the virtqueues?

4) Would it make sense to share memory regions between multiple vhost
devices?  Would it be hard to implement?  It would also make memory
operations O(1) rather than O(#cpus).

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Igor Mammedov June 17, 2015, 5:30 p.m. UTC | #26
On Wed, 17 Jun 2015 18:30:02 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 17:38:40 +0200
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
> > > > On Wed, 17 Jun 2015 16:32:02 +0200
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > 
> > > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > > > > > 
> > > > > > 
> > > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > > > > > Considering userspace can be malicious, I guess yes.
> > > > > > > > I don't think it's a valid concern in this case,
> > > > > > > > setting limit back from 509 to 64 will not help here in
> > > > > > > > any way, userspace still can create as many vhost
> > > > > > > > instances as it needs to consume memory it desires.
> > > > > > > 
> > > > > > > Not really since vhost char device isn't world-accessible.
> > > > > > > It's typically opened by a priveledged tool, the fd is
> > > > > > > then passed to an unpriveledged userspace, or permissions
> > > > > > > dropped.
> > > > > > 
> > > > > > Then what's the concern anyway?
> > > > > > 
> > > > > > Paolo
> > > > > 
> > > > > Each fd now ties up 16K of kernel memory.  It didn't use to,
> > > > > so priveledged tool could safely give the unpriveledged
> > > > > userspace a ton of these fds.
> > > > if privileged tool gives out unlimited amount of fds then it
> > > > doesn't matter whether fd ties 4K or 16K, host still could be
> > > > DoSed.
> > > > 
> > > 
> > > Of course it does not give out unlimited fds, there's a way
> > > for the sysadmin to specify the number of fds. Look at how libvirt
> > > uses vhost, it should become clear I think.
> > then it just means that tool has to take into account a new limits
> > to partition host in sensible manner.
> 
> Meanwhile old tools are vulnerable to OOM attacks.
Let's leave old limit by default and allow override it via module
parameter, that way tools old tools won't be affected and new tools
could set limit the way they need.
That will accommodate current slot hungry userspace and a new one with
continuous HVA and won't regress old tools.

> 
> > Exposing limit as module parameter might be of help to tool for
> > getting/setting it in a way it needs.
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Igor Mammedov June 17, 2015, 5:32 p.m. UTC | #27
On Wed, 17 Jun 2015 18:47:18 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> 
> 
> On 17/06/2015 18:41, Michael S. Tsirkin wrote:
> > On Wed, Jun 17, 2015 at 06:38:25PM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 17/06/2015 18:34, Michael S. Tsirkin wrote:
> >>> On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote:
> >>>>
> >>>>
> >>>> On 17/06/2015 18:30, Michael S. Tsirkin wrote:
> >>>>> Meanwhile old tools are vulnerable to OOM attacks.
> >>>>
> >>>> For each vhost device there will be likely one tap interface,
> >>>> and I suspect that it takes way, way more than 16KB of memory.
> >>>
> >>> That's not true. We have a vhost device per queue, all queues
> >>> are part of a single tap device.
> >>
> >> s/tap/VCPU/ then.  A KVM VCPU also takes more than 16KB of memory.
> > 
> > That's up to you as a kvm maintainer :)
> 
> Not easy, when the CPU alone requires three (albeit non-consecutive)
> pages for the VMCS, the APIC access page and the EPT root.
> 
> > People are already concerned about vhost device
> > memory usage, I'm not happy to define our user/kernel interface
> > in a way that forces even more memory to be used up.
> 
> So, the questions to ask are:
> 
> 1) What is the memory usage like immediately after vhost is brought
> up, apart from these 16K?
> 
> 2) Is there anything in vhost that allocates a user-controllable
> amount of memory?
> 
> 3) What is the size of the data structures that support one virtqueue
> (there are two of them)?  Does it depend on the size of the
> virtqueues?
> 
> 4) Would it make sense to share memory regions between multiple vhost
> devices?  Would it be hard to implement?  It would also make memory
> operations O(1) rather than O(#cpus).
> 
> Paolo

in addition to that could vhost share memmap with KVM i.e. use its
memslots instead of duplicating it?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 17, 2015, 7:11 p.m. UTC | #28
On Wed, Jun 17, 2015 at 06:47:18PM +0200, Paolo Bonzini wrote:
> 
> 
> On 17/06/2015 18:41, Michael S. Tsirkin wrote:
> > On Wed, Jun 17, 2015 at 06:38:25PM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 17/06/2015 18:34, Michael S. Tsirkin wrote:
> >>> On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote:
> >>>>
> >>>>
> >>>> On 17/06/2015 18:30, Michael S. Tsirkin wrote:
> >>>>> Meanwhile old tools are vulnerable to OOM attacks.
> >>>>
> >>>> For each vhost device there will be likely one tap interface, and I
> >>>> suspect that it takes way, way more than 16KB of memory.
> >>>
> >>> That's not true. We have a vhost device per queue, all queues
> >>> are part of a single tap device.
> >>
> >> s/tap/VCPU/ then.  A KVM VCPU also takes more than 16KB of memory.
> > 
> > That's up to you as a kvm maintainer :)
> 
> Not easy, when the CPU alone requires three (albeit non-consecutive)
> pages for the VMCS, the APIC access page and the EPT root.
> 
> > People are already concerned about vhost device
> > memory usage, I'm not happy to define our user/kernel interface
> > in a way that forces even more memory to be used up.
> 
> So, the questions to ask are:
> 
> 1) What is the memory usage like immediately after vhost is brought up,
> apart from these 16K?

About 24K, but most are iov pool arrays are kept around as an optimization
to avoid kmalloc on data path. Below 1K tracks persistent state.
Recently people have been complaining about these pools
so I've been thinking about switching to a per-cpu array,
or something similar.

> 2) Is there anything in vhost that allocates a user-controllable amount
> of memory?

Definitely not in vhost-net.

> 3) What is the size of the data structures that support one virtqueue
> (there are two of them)?

Around 256 bytes.

>  Does it depend on the size of the virtqueues?

No.

> 4) Would it make sense to share memory regions between multiple vhost
> devices?  Would it be hard to implement?

It's not trivial. It would absolutely require userspace ABI
extensions.

>  It would also make memory
> operations O(1) rather than O(#cpus).
> 
> Paolo

We'd save the kmalloc/memcpy/kfree, that is true.

But we'd still need to flush all VQs so it's still O(#cpus),
we'd just be doing less stuff in that O(#cpus).
Igor Mammedov June 18, 2015, 9:12 a.m. UTC | #29
On Wed, 17 Jun 2015 18:30:02 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 17:38:40 +0200
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
> > > > On Wed, 17 Jun 2015 16:32:02 +0200
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > 
> > > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > > > > > 
> > > > > > 
> > > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > > > > > Considering userspace can be malicious, I guess yes.
> > > > > > > > I don't think it's a valid concern in this case,
> > > > > > > > setting limit back from 509 to 64 will not help here in any
> > > > > > > > way, userspace still can create as many vhost instances as
> > > > > > > > it needs to consume memory it desires.
> > > > > > > 
> > > > > > > Not really since vhost char device isn't world-accessible.
> > > > > > > It's typically opened by a priveledged tool, the fd is
> > > > > > > then passed to an unpriveledged userspace, or permissions
> > > > > > > dropped.
> > > > > > 
> > > > > > Then what's the concern anyway?
> > > > > > 
> > > > > > Paolo
> > > > > 
> > > > > Each fd now ties up 16K of kernel memory.  It didn't use to, so
> > > > > priveledged tool could safely give the unpriveledged userspace
> > > > > a ton of these fds.
> > > > if privileged tool gives out unlimited amount of fds then it
> > > > doesn't matter whether fd ties 4K or 16K, host still could be DoSed.
> > > > 
> > > 
> > > Of course it does not give out unlimited fds, there's a way
> > > for the sysadmin to specify the number of fds. Look at how libvirt
> > > uses vhost, it should become clear I think.
> > then it just means that tool has to take into account a new limits
> > to partition host in sensible manner.
> 
> Meanwhile old tools are vulnerable to OOM attacks.
I've chatted with libvirt folks, it doesn't care about how much memory
vhost would consume nor do any host capacity planning in that regard.

But lets assume that there are tools that do this so
how about instead of hardcoding limit make it a module parameter
with default set to 64. That would allow users to set higher limit
if they need it and nor regress old tools. it will also give tools
interface for reading limit from vhost module.

> 
> > Exposing limit as module parameter might be of help to tool for
> > getting/setting it in a way it needs.
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 18, 2015, 9:50 a.m. UTC | #30
On Thu, Jun 18, 2015 at 11:12:24AM +0200, Igor Mammedov wrote:
> On Wed, 17 Jun 2015 18:30:02 +0200
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote:
> > > On Wed, 17 Jun 2015 17:38:40 +0200
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > 
> > > > On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
> > > > > On Wed, 17 Jun 2015 16:32:02 +0200
> > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > 
> > > > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > > > > > > Considering userspace can be malicious, I guess yes.
> > > > > > > > > I don't think it's a valid concern in this case,
> > > > > > > > > setting limit back from 509 to 64 will not help here in any
> > > > > > > > > way, userspace still can create as many vhost instances as
> > > > > > > > > it needs to consume memory it desires.
> > > > > > > > 
> > > > > > > > Not really since vhost char device isn't world-accessible.
> > > > > > > > It's typically opened by a priveledged tool, the fd is
> > > > > > > > then passed to an unpriveledged userspace, or permissions
> > > > > > > > dropped.
> > > > > > > 
> > > > > > > Then what's the concern anyway?
> > > > > > > 
> > > > > > > Paolo
> > > > > > 
> > > > > > Each fd now ties up 16K of kernel memory.  It didn't use to, so
> > > > > > priveledged tool could safely give the unpriveledged userspace
> > > > > > a ton of these fds.
> > > > > if privileged tool gives out unlimited amount of fds then it
> > > > > doesn't matter whether fd ties 4K or 16K, host still could be DoSed.
> > > > > 
> > > > 
> > > > Of course it does not give out unlimited fds, there's a way
> > > > for the sysadmin to specify the number of fds. Look at how libvirt
> > > > uses vhost, it should become clear I think.
> > > then it just means that tool has to take into account a new limits
> > > to partition host in sensible manner.
> > 
> > Meanwhile old tools are vulnerable to OOM attacks.
> I've chatted with libvirt folks, it doesn't care about how much memory
> vhost would consume nor do any host capacity planning in that regard.

Exactly, it's up to host admin.

> But lets assume that there are tools that do this so
> how about instead of hardcoding limit make it a module parameter
> with default set to 64. That would allow users to set higher limit
> if they need it and nor regress old tools. it will also give tools
> interface for reading limit from vhost module.

And now you need to choose between security and functionality :(

> > 
> > > Exposing limit as module parameter might be of help to tool for
> > > getting/setting it in a way it needs.
> > 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini June 18, 2015, 10:03 a.m. UTC | #31
On 18/06/2015 11:50, Michael S. Tsirkin wrote:
> > But lets assume that there are tools that do this so
> > how about instead of hardcoding limit make it a module parameter
> > with default set to 64. That would allow users to set higher limit
> > if they need it and nor regress old tools. it will also give tools
> > interface for reading limit from vhost module.
> 
> And now you need to choose between security and functionality :(

Don't call "security" a 16K allocation that can fall back to vmalloc
please.  That's an insult to actual security problems...

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Igor Mammedov June 18, 2015, 11:39 a.m. UTC | #32
On Thu, 18 Jun 2015 11:50:22 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Jun 18, 2015 at 11:12:24AM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 18:30:02 +0200
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote:
> > > > On Wed, 17 Jun 2015 17:38:40 +0200
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > 
> > > > > On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
> > > > > > On Wed, 17 Jun 2015 16:32:02 +0200
> > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > 
> > > > > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > > > > > > > Considering userspace can be malicious, I guess yes.
> > > > > > > > > > I don't think it's a valid concern in this case,
> > > > > > > > > > setting limit back from 509 to 64 will not help here in any
> > > > > > > > > > way, userspace still can create as many vhost instances as
> > > > > > > > > > it needs to consume memory it desires.
> > > > > > > > > 
> > > > > > > > > Not really since vhost char device isn't world-accessible.
> > > > > > > > > It's typically opened by a priveledged tool, the fd is
> > > > > > > > > then passed to an unpriveledged userspace, or permissions
> > > > > > > > > dropped.
> > > > > > > > 
> > > > > > > > Then what's the concern anyway?
> > > > > > > > 
> > > > > > > > Paolo
> > > > > > > 
> > > > > > > Each fd now ties up 16K of kernel memory.  It didn't use to, so
> > > > > > > priveledged tool could safely give the unpriveledged userspace
> > > > > > > a ton of these fds.
> > > > > > if privileged tool gives out unlimited amount of fds then it
> > > > > > doesn't matter whether fd ties 4K or 16K, host still could be DoSed.
> > > > > > 
> > > > > 
> > > > > Of course it does not give out unlimited fds, there's a way
> > > > > for the sysadmin to specify the number of fds. Look at how libvirt
> > > > > uses vhost, it should become clear I think.
> > > > then it just means that tool has to take into account a new limits
> > > > to partition host in sensible manner.
> > > 
> > > Meanwhile old tools are vulnerable to OOM attacks.
> > I've chatted with libvirt folks, it doesn't care about how much memory
> > vhost would consume nor do any host capacity planning in that regard.
> 
> Exactly, it's up to host admin.
> 
> > But lets assume that there are tools that do this so
> > how about instead of hardcoding limit make it a module parameter
> > with default set to 64. That would allow users to set higher limit
> > if they need it and nor regress old tools. it will also give tools
> > interface for reading limit from vhost module.
> 
> And now you need to choose between security and functionality :(
There is no conflict here and it's not about choosing.
If admin has a method to estimate guest memory footprint
to do capacity partitioning then he would need to redo
partitioning taking in account new footprint when
he/she rises limit manually.

(BTW libvirt has tried and reverted patches that were trying to
predict required memory, admin might be able to do it manually
better but it's another topic how to do it ans it's not related
to this thread)

Lets leave decision upto users instead of making them live with
crashing guests.

> 
> > > 
> > > > Exposing limit as module parameter might be of help to tool for
> > > > getting/setting it in a way it needs.
> > > 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 18, 2015, 11:41 a.m. UTC | #33
On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote:
> Lets leave decision upto users instead of making them live with
> crashing guests.

Come on, let's fix it in userspace.
Paolo Bonzini June 18, 2015, 11:50 a.m. UTC | #34
On 18/06/2015 13:41, Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote:
>> Lets leave decision upto users instead of making them live with
>> crashing guests.
> 
> Come on, let's fix it in userspace.

It's not trivial to fix it in userspace.  Since QEMU uses RCU there
isn't a single memory map to use for a linear gpa->hva map.

I find it absurd that we're fighting over 12K of memory.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Igor Mammedov June 18, 2015, 12:02 p.m. UTC | #35
On Thu, 18 Jun 2015 13:41:22 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote:
> > Lets leave decision upto users instead of making them live with
> > crashing guests.
> 
> Come on, let's fix it in userspace.
I'm not abandoning userspace approach either but it might take time
to implement in robust manner as it's much more complex and has much
more places to backfire then a straightforward kernel fix which will
work for both old userspace and a new one.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 18, 2015, 1:19 p.m. UTC | #36
On Thu, Jun 18, 2015 at 01:50:32PM +0200, Paolo Bonzini wrote:
> 
> 
> On 18/06/2015 13:41, Michael S. Tsirkin wrote:
> > On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote:
> >> Lets leave decision upto users instead of making them live with
> >> crashing guests.
> > 
> > Come on, let's fix it in userspace.
> 
> It's not trivial to fix it in userspace.  Since QEMU uses RCU there
> isn't a single memory map to use for a linear gpa->hva map.

Could you elaborate?

I'm confused by this mention of RCU.
You use RCU for accesses to the memory map, correct?
So memory map itself is a write side operation, as such all you need to
do is take some kind of lock to prevent conflicting with other memory
maps, do rcu sync under this lock.


> I find it absurd that we're fighting over 12K of memory.
> 
> Paolo

I wouldn't worry so much if it didn't affect kernel/userspace API.
Need to be careful there.
Paolo Bonzini June 18, 2015, 1:46 p.m. UTC | #37
On 18/06/2015 15:19, Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2015 at 01:50:32PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 18/06/2015 13:41, Michael S. Tsirkin wrote:
>>> On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote:
>>>> Lets leave decision upto users instead of making them live with
>>>> crashing guests.
>>>
>>> Come on, let's fix it in userspace.
>>
>> It's not trivial to fix it in userspace.  Since QEMU uses RCU there
>> isn't a single memory map to use for a linear gpa->hva map.
> 
> Could you elaborate?
> 
> I'm confused by this mention of RCU.
> You use RCU for accesses to the memory map, correct?
> So memory map itself is a write side operation, as such all you need to
> do is take some kind of lock to prevent conflicting with other memory
> maps, do rcu sync under this lock.

You're right, the problem isn't directly related to RCU.  RCU would be
easy to handle by using synchronize_rcu instead of call_rcu.  While I
identified an RCU-related problem with Igor's patches, it's much more
entrenched.

RAM can be used by asynchronous operations while the VM runs, between
address_space_map and address_space_unmap.  It is possible and common to
have a quiescent state between the map and unmap, and a memory map
change can happen in the middle of this.  Normally this is not a
problem, because changes to the memory map do not make the hva go away
(memory regions are reference counted).

However, with Igor's patches a memory_region_del_subregion will cause a
mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away.

I guess one way to do it would be to alias the same page in two places,
one for use by vhost and one for use by everything else.  However, the
kernel does not provide the means to do this kind of aliasing for
anonymous mmaps.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 18, 2015, 2:47 p.m. UTC | #38
On Thu, Jun 18, 2015 at 03:46:14PM +0200, Paolo Bonzini wrote:
> 
> 
> On 18/06/2015 15:19, Michael S. Tsirkin wrote:
> > On Thu, Jun 18, 2015 at 01:50:32PM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 18/06/2015 13:41, Michael S. Tsirkin wrote:
> >>> On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote:
> >>>> Lets leave decision upto users instead of making them live with
> >>>> crashing guests.
> >>>
> >>> Come on, let's fix it in userspace.
> >>
> >> It's not trivial to fix it in userspace.  Since QEMU uses RCU there
> >> isn't a single memory map to use for a linear gpa->hva map.
> > 
> > Could you elaborate?
> > 
> > I'm confused by this mention of RCU.
> > You use RCU for accesses to the memory map, correct?
> > So memory map itself is a write side operation, as such all you need to
> > do is take some kind of lock to prevent conflicting with other memory
> > maps, do rcu sync under this lock.
> 
> You're right, the problem isn't directly related to RCU.  RCU would be
> easy to handle by using synchronize_rcu instead of call_rcu.  While I
> identified an RCU-related problem with Igor's patches, it's much more
> entrenched.
> 
> RAM can be used by asynchronous operations while the VM runs, between
> address_space_map and address_space_unmap.  It is possible and common to
> have a quiescent state between the map and unmap, and a memory map
> change can happen in the middle of this.  Normally this is not a
> problem, because changes to the memory map do not make the hva go away
> (memory regions are reference counted).

Right, so you want mmap(MAP_NORESERVE) when that reference
count becomes 0.

> However, with Igor's patches a memory_region_del_subregion will cause a
> mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away.
> 
> I guess one way to do it would be to alias the same page in two places,
> one for use by vhost and one for use by everything else.  However, the
> kernel does not provide the means to do this kind of aliasing for
> anonymous mmaps.
> 
> Paolo

Basically pages go away on munmap, so won't simple
	lock
	munmap
	mmap(MAP_NORESERVE)
	unlock
do the trick?
Igor Mammedov June 18, 2015, 3:54 p.m. UTC | #39
On Thu, 18 Jun 2015 16:47:33 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Jun 18, 2015 at 03:46:14PM +0200, Paolo Bonzini wrote:
> > 
> > 
> > On 18/06/2015 15:19, Michael S. Tsirkin wrote:
> > > On Thu, Jun 18, 2015 at 01:50:32PM +0200, Paolo Bonzini wrote:
> > >>
> > >>
> > >> On 18/06/2015 13:41, Michael S. Tsirkin wrote:
> > >>> On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote:
> > >>>> Lets leave decision upto users instead of making them live with
> > >>>> crashing guests.
> > >>>
> > >>> Come on, let's fix it in userspace.
> > >>
> > >> It's not trivial to fix it in userspace.  Since QEMU uses RCU there
> > >> isn't a single memory map to use for a linear gpa->hva map.
> > > 
> > > Could you elaborate?
> > > 
> > > I'm confused by this mention of RCU.
> > > You use RCU for accesses to the memory map, correct?
> > > So memory map itself is a write side operation, as such all you need to
> > > do is take some kind of lock to prevent conflicting with other memory
> > > maps, do rcu sync under this lock.
> > 
> > You're right, the problem isn't directly related to RCU.  RCU would be
> > easy to handle by using synchronize_rcu instead of call_rcu.  While I
> > identified an RCU-related problem with Igor's patches, it's much more
> > entrenched.
> > 
> > RAM can be used by asynchronous operations while the VM runs, between
> > address_space_map and address_space_unmap.  It is possible and common to
> > have a quiescent state between the map and unmap, and a memory map
> > change can happen in the middle of this.  Normally this is not a
> > problem, because changes to the memory map do not make the hva go away
> > (memory regions are reference counted).
> 
> Right, so you want mmap(MAP_NORESERVE) when that reference
> count becomes 0.
> 
> > However, with Igor's patches a memory_region_del_subregion will cause a
> > mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away.
> > 
> > I guess one way to do it would be to alias the same page in two places,
> > one for use by vhost and one for use by everything else.  However, the
> > kernel does not provide the means to do this kind of aliasing for
> > anonymous mmaps.
> > 
> > Paolo
> 
> Basically pages go away on munmap, so won't simple
> 	lock
> 	munmap
> 	mmap(MAP_NORESERVE)
> 	unlock
> do the trick?
at what time are you suggesting to do this?



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini June 18, 2015, 4:02 p.m. UTC | #40
On 18/06/2015 16:47, Michael S. Tsirkin wrote:
>> However, with Igor's patches a memory_region_del_subregion will cause a
>> mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away.
>>
>> I guess one way to do it would be to alias the same page in two places,
>> one for use by vhost and one for use by everything else.  However, the
>> kernel does not provide the means to do this kind of aliasing for
>> anonymous mmaps.
> 
> Basically pages go away on munmap, so won't simple
> 	lock
> 	munmap
> 	mmap(MAP_NORESERVE)
> 	unlock
> do the trick?

Not sure I follow.  Here we have this:

    VCPU 1                             VCPU 2                          I/O worker
    ----------------------------------------------------------------------------------------
    take big QEMU lock
    p = address_space_map(hva, len)
    pass I/O request to worker thread
                                                                       read(fd, p, len)
    release big QEMU lock

                                        memory_region_del_subregion
                                          mmap(MAP_NORESERVE)

                                                                       read returns EFAULT
                                                                       wake up VCPU 1
    take big QEMU lock
    EFAULT?  What's that?

In another scenario you are less lucky: the memory accesses
between address_space_map/unmap aren't done in the kernel and
you get a plain old SIGSEGV.

This is not something that you can fix with a lock.  The very
purpose of the map/unmap API is to do stuff asynchronously while
the lock is released.

Thanks,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 19, 2015, 7:56 a.m. UTC | #41
On Thu, Jun 18, 2015 at 06:02:46PM +0200, Paolo Bonzini wrote:
> 
> 
> On 18/06/2015 16:47, Michael S. Tsirkin wrote:
> >> However, with Igor's patches a memory_region_del_subregion will cause a
> >> mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away.
> >>
> >> I guess one way to do it would be to alias the same page in two places,
> >> one for use by vhost and one for use by everything else.  However, the
> >> kernel does not provide the means to do this kind of aliasing for
> >> anonymous mmaps.
> > 
> > Basically pages go away on munmap, so won't simple
> > 	lock
> > 	munmap
> > 	mmap(MAP_NORESERVE)
> > 	unlock
> > do the trick?
> 
> Not sure I follow.  Here we have this:
> 
>     VCPU 1                             VCPU 2                          I/O worker
>     ----------------------------------------------------------------------------------------
>     take big QEMU lock
>     p = address_space_map(hva, len)
>     pass I/O request to worker thread
>                                                                        read(fd, p, len)
>     release big QEMU lock
> 
>                                         memory_region_del_subregion
>                                           mmap(MAP_NORESERVE)
> 
>                                                                        read returns EFAULT

Why doesn't it EFAULT without mmap(MAP_NORESERVE)?
Doesn't memory_region_del_subregion free the memory?

>                                                                        wake up VCPU 1
>     take big QEMU lock
>     EFAULT?  What's that?
> 
> In another scenario you are less lucky: the memory accesses
> between address_space_map/unmap aren't done in the kernel and
> you get a plain old SIGSEGV.
> 
> This is not something that you can fix with a lock.  The very
> purpose of the map/unmap API is to do stuff asynchronously while
> the lock is released.
> 
> Thanks,
> 
> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini June 19, 2015, 7:57 a.m. UTC | #42
On 19/06/2015 09:56, Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2015 at 06:02:46PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 18/06/2015 16:47, Michael S. Tsirkin wrote:
>>>> However, with Igor's patches a memory_region_del_subregion will cause a
>>>> mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away.
>>>>
>>>> I guess one way to do it would be to alias the same page in two places,
>>>> one for use by vhost and one for use by everything else.  However, the
>>>> kernel does not provide the means to do this kind of aliasing for
>>>> anonymous mmaps.
>>>
>>> Basically pages go away on munmap, so won't simple
>>> 	lock
>>> 	munmap
>>> 	mmap(MAP_NORESERVE)
>>> 	unlock
>>> do the trick?
>>
>> Not sure I follow.  Here we have this:
>>
>>     VCPU 1                             VCPU 2                          I/O worker
>>     ----------------------------------------------------------------------------------------
>>     take big QEMU lock
>>     p = address_space_map(hva, len)
>>     pass I/O request to worker thread
>>                                                                        read(fd, p, len)
>>     release big QEMU lock
>>
>>                                         memory_region_del_subregion
>>                                           mmap(MAP_NORESERVE)
>>
>>                                                                        read returns EFAULT
> 
> Why doesn't it EFAULT without mmap(MAP_NORESERVE)?
> Doesn't memory_region_del_subregion free the memory?

No, only destruction of the memory region frees it.  address_space_map
takes a reference to the memory region and address_space_unmap releases it.

Paolo

>>                                                                        wake up VCPU 1
>>     take big QEMU lock
>>     EFAULT?  What's that?
>>
>> In another scenario you are less lucky: the memory accesses
>> between address_space_map/unmap aren't done in the kernel and
>> you get a plain old SIGSEGV.
>>
>> This is not something that you can fix with a lock.  The very
>> purpose of the map/unmap API is to do stuff asynchronously while
>> the lock is released.
>>
>> Thanks,
>>
>> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 19, 2015, 8:05 a.m. UTC | #43
On Fri, Jun 19, 2015 at 09:57:22AM +0200, Paolo Bonzini wrote:
> 
> 
> On 19/06/2015 09:56, Michael S. Tsirkin wrote:
> > On Thu, Jun 18, 2015 at 06:02:46PM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 18/06/2015 16:47, Michael S. Tsirkin wrote:
> >>>> However, with Igor's patches a memory_region_del_subregion will cause a
> >>>> mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away.
> >>>>
> >>>> I guess one way to do it would be to alias the same page in two places,
> >>>> one for use by vhost and one for use by everything else.  However, the
> >>>> kernel does not provide the means to do this kind of aliasing for
> >>>> anonymous mmaps.
> >>>
> >>> Basically pages go away on munmap, so won't simple
> >>> 	lock
> >>> 	munmap
> >>> 	mmap(MAP_NORESERVE)
> >>> 	unlock
> >>> do the trick?
> >>
> >> Not sure I follow.  Here we have this:
> >>
> >>     VCPU 1                             VCPU 2                          I/O worker
> >>     ----------------------------------------------------------------------------------------
> >>     take big QEMU lock
> >>     p = address_space_map(hva, len)
> >>     pass I/O request to worker thread
> >>                                                                        read(fd, p, len)
> >>     release big QEMU lock
> >>
> >>                                         memory_region_del_subregion
> >>                                           mmap(MAP_NORESERVE)
> >>
> >>                                                                        read returns EFAULT
> > 
> > Why doesn't it EFAULT without mmap(MAP_NORESERVE)?
> > Doesn't memory_region_del_subregion free the memory?
> 
> No, only destruction of the memory region frees it.  address_space_map
> takes a reference to the memory region and address_space_unmap releases it.
> 
> Paolo

Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap
after we detect refcount is 0?

> >>                                                                        wake up VCPU 1
> >>     take big QEMU lock
> >>     EFAULT?  What's that?
> >>
> >> In another scenario you are less lucky: the memory accesses
> >> between address_space_map/unmap aren't done in the kernel and
> >> you get a plain old SIGSEGV.
> >>
> >> This is not something that you can fix with a lock.  The very
> >> purpose of the map/unmap API is to do stuff asynchronously while
> >> the lock is released.
> >>
> >> Thanks,
> >>
> >> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini June 19, 2015, 8:52 a.m. UTC | #44
On 19/06/2015 10:05, Michael S. Tsirkin wrote:
> > No, only destruction of the memory region frees it.  address_space_map
> > takes a reference to the memory region and address_space_unmap releases it.
> > 
> > Paolo
> 
> Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap
> after we detect refcount is 0?

No, because in the meanwhile another DIMM could have been hotplugged
at the same place where the old one was.  This is legal:

    user                      guest                 QEMU
    ----------------------------------------------------------------------------------------
                              start I/O
                                  '---------------> address_space_map
    device_del
        '-------------------> receives SCI
                              executes _EJ0
                                  '---------------> memory_region_del_subregion
                                                    object_unparent
    device_add
        '-----------------------------------------> device_set_realized
                                                      hotplug_handler_plug
                                                        pc_machine_device_plug_cb
                                                          pc_dimm_plug
                                                            memory_region_add_subregion

                                                    I/O finishes
                                                      address_space_unmap

Surprise removal similarly could be done in QEMU, but it will hold to
some resources for as long as the device backends need them.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 19, 2015, 10:14 a.m. UTC | #45
On Fri, Jun 19, 2015 at 10:52:47AM +0200, Paolo Bonzini wrote:
> 
> 
> On 19/06/2015 10:05, Michael S. Tsirkin wrote:
> > > No, only destruction of the memory region frees it.  address_space_map
> > > takes a reference to the memory region and address_space_unmap releases it.
> > > 
> > > Paolo
> > 
> > Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap
> > after we detect refcount is 0?
> 
> No, because in the meanwhile another DIMM could have been hotplugged
> at the same place where the old one was.  This is legal:
> 
>     user                      guest                 QEMU
>     ----------------------------------------------------------------------------------------
>                               start I/O
>                                   '---------------> address_space_map
>     device_del
>         '-------------------> receives SCI
>                               executes _EJ0
>                                   '---------------> memory_region_del_subregion
>                                                     object_unparent

So guest started DMA into memory, then ejected this memory while DMA
is in progress?

>     device_add
>         '-----------------------------------------> device_set_realized
>                                                       hotplug_handler_plug
>                                                         pc_machine_device_plug_cb
>                                                           pc_dimm_plug
>                                                             memory_region_add_subregion
> 
>                                                     I/O finishes
>                                                       address_space_unmap
> 
> Surprise removal similarly could be done in QEMU, but it will hold to
> some resources for as long as the device backends need them.
> 
> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini June 19, 2015, 10:44 a.m. UTC | #46
On 19/06/2015 12:14, Michael S. Tsirkin wrote:
> On Fri, Jun 19, 2015 at 10:52:47AM +0200, Paolo Bonzini wrote:
>>
>>
>> On 19/06/2015 10:05, Michael S. Tsirkin wrote:
>>>> No, only destruction of the memory region frees it.  address_space_map
>>>> takes a reference to the memory region and address_space_unmap releases it.
>>>>
>>>> Paolo
>>>
>>> Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap
>>> after we detect refcount is 0?
>>
>> No, because in the meanwhile another DIMM could have been hotplugged
>> at the same place where the old one was.  This is legal:
>>
>>     user                      guest                 QEMU
>>     ----------------------------------------------------------------------------------------
>>                               start I/O
>>                                   '---------------> address_space_map
>>     device_del
>>         '-------------------> receives SCI
>>                               executes _EJ0
>>                                   '---------------> memory_region_del_subregion
>>                                                     object_unparent
> 
> So guest started DMA into memory, then ejected this memory while DMA
> is in progress?

Yes.  There is nothing that forbids doing that.

Paolo

>>     device_add
>>         '-----------------------------------------> device_set_realized
>>                                                       hotplug_handler_plug
>>                                                         pc_machine_device_plug_cb
>>                                                           pc_dimm_plug
>>                                                             memory_region_add_subregion
>>
>>                                                     I/O finishes
>>                                                       address_space_unmap
>>
>> Surprise removal similarly could be done in QEMU, but it will hold to
>> some resources for as long as the device backends need them.
>>
>> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 19, 2015, 1:34 p.m. UTC | #47
On Fri, Jun 19, 2015 at 12:44:26PM +0200, Paolo Bonzini wrote:
> 
> 
> On 19/06/2015 12:14, Michael S. Tsirkin wrote:
> > On Fri, Jun 19, 2015 at 10:52:47AM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 19/06/2015 10:05, Michael S. Tsirkin wrote:
> >>>> No, only destruction of the memory region frees it.  address_space_map
> >>>> takes a reference to the memory region and address_space_unmap releases it.
> >>>>
> >>>> Paolo
> >>>
> >>> Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap
> >>> after we detect refcount is 0?
> >>
> >> No, because in the meanwhile another DIMM could have been hotplugged
> >> at the same place where the old one was.  This is legal:
> >>
> >>     user                      guest                 QEMU
> >>     ----------------------------------------------------------------------------------------
> >>                               start I/O
> >>                                   '---------------> address_space_map
> >>     device_del
> >>         '-------------------> receives SCI
> >>                               executes _EJ0
> >>                                   '---------------> memory_region_del_subregion
> >>                                                     object_unparent
> > 
> > So guest started DMA into memory, then ejected this memory while DMA
> > is in progress?
> 
> Yes.  There is nothing that forbids doing that.
> 
> Paolo

Can we simply defer the next device_add using a hva until all IO completes?

> >>     device_add
> >>         '-----------------------------------------> device_set_realized
> >>                                                       hotplug_handler_plug
> >>                                                         pc_machine_device_plug_cb
> >>                                                           pc_dimm_plug
> >>                                                             memory_region_add_subregion
> >>
> >>                                                     I/O finishes
> >>                                                       address_space_unmap
> >>
> >> Surprise removal similarly could be done in QEMU, but it will hold to
> >> some resources for as long as the device backends need them.
> >>
> >> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
Paolo Bonzini June 19, 2015, 3:19 p.m. UTC | #48
On 19/06/2015 15:34, Michael S. Tsirkin wrote:
> On Fri, Jun 19, 2015 at 12:44:26PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 19/06/2015 12:14, Michael S. Tsirkin wrote:
>>> On Fri, Jun 19, 2015 at 10:52:47AM +0200, Paolo Bonzini wrote:
>>>>
>>>>
>>>> On 19/06/2015 10:05, Michael S. Tsirkin wrote:
>>>>>> No, only destruction of the memory region frees it.  address_space_map
>>>>>> takes a reference to the memory region and address_space_unmap releases it.
>>>>>>
>>>>>> Paolo
>>>>>
>>>>> Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap
>>>>> after we detect refcount is 0?
>>>>
>>>> No, because in the meanwhile another DIMM could have been hotplugged
>>>> at the same place where the old one was.  This is legal:
>>>>
>>>>     user                      guest                 QEMU
>>>>     ----------------------------------------------------------------------------------------
>>>>                               start I/O
>>>>                                   '---------------> address_space_map
>>>>     device_del
>>>>         '-------------------> receives SCI
>>>>                               executes _EJ0
>>>>                                   '---------------> memory_region_del_subregion
>>>>                                                     object_unparent
>>>
>>> So guest started DMA into memory, then ejected this memory while DMA
>>> is in progress?
>>
>> Yes.  There is nothing that forbids doing that.
> 
> Can we simply defer the next device_add using a hva until all IO completes?

We could, but I/O is just an example.  It can be I/O, a network ring,
whatever.  We cannot audit all address_space_map uses.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
Michael S. Tsirkin June 19, 2015, 4:20 p.m. UTC | #49
On Fri, Jun 19, 2015 at 05:19:44PM +0200, Paolo Bonzini wrote:
> 
> 
> On 19/06/2015 15:34, Michael S. Tsirkin wrote:
> > On Fri, Jun 19, 2015 at 12:44:26PM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 19/06/2015 12:14, Michael S. Tsirkin wrote:
> >>> On Fri, Jun 19, 2015 at 10:52:47AM +0200, Paolo Bonzini wrote:
> >>>>
> >>>>
> >>>> On 19/06/2015 10:05, Michael S. Tsirkin wrote:
> >>>>>> No, only destruction of the memory region frees it.  address_space_map
> >>>>>> takes a reference to the memory region and address_space_unmap releases it.
> >>>>>>
> >>>>>> Paolo
> >>>>>
> >>>>> Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap
> >>>>> after we detect refcount is 0?
> >>>>
> >>>> No, because in the meanwhile another DIMM could have been hotplugged
> >>>> at the same place where the old one was.  This is legal:
> >>>>
> >>>>     user                      guest                 QEMU
> >>>>     ----------------------------------------------------------------------------------------
> >>>>                               start I/O
> >>>>                                   '---------------> address_space_map
> >>>>     device_del
> >>>>         '-------------------> receives SCI
> >>>>                               executes _EJ0
> >>>>                                   '---------------> memory_region_del_subregion
> >>>>                                                     object_unparent
> >>>
> >>> So guest started DMA into memory, then ejected this memory while DMA
> >>> is in progress?
> >>
> >> Yes.  There is nothing that forbids doing that.
> > 
> > Can we simply defer the next device_add using a hva until all IO completes?
> 
> We could, but I/O is just an example.  It can be I/O, a network ring,
> whatever.  We cannot audit all address_space_map uses.
> 
> Paolo

No need to audit them all: defer device_add using an hva range until
address_space_unmap drops using hvas in range drops reference count to
0.
Paolo Bonzini June 19, 2015, 4:26 p.m. UTC | #50
On 19/06/2015 18:20, Michael S. Tsirkin wrote:
> > We could, but I/O is just an example.  It can be I/O, a network ring,
> > whatever.  We cannot audit all address_space_map uses.
> > 
>
> No need to audit them all: defer device_add using an hva range until
> address_space_unmap drops using hvas in range drops reference count to
> 0.

That could be forever.  You certainly don't want to lockup the monitor
forever just because a device model isn't too friendly to memory hot-unplug.

That's why you need to audit them (also, it's perfectly in the device
model's right to use address_space_unmap this way: it's the guest that's
buggy and leaves a dangling reference to a region before unplugging it).

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
Michael S. Tsirkin June 19, 2015, 4:33 p.m. UTC | #51
On Fri, Jun 19, 2015 at 06:26:27PM +0200, Paolo Bonzini wrote:
> 
> 
> On 19/06/2015 18:20, Michael S. Tsirkin wrote:
> > > We could, but I/O is just an example.  It can be I/O, a network ring,
> > > whatever.  We cannot audit all address_space_map uses.
> > > 
> >
> > No need to audit them all: defer device_add using an hva range until
> > address_space_unmap drops using hvas in range drops reference count to
> > 0.
> 
> That could be forever.  You certainly don't want to lockup the monitor
> forever just because a device model isn't too friendly to memory hot-unplug.

We can defer the addition, no need to lockup the monitor.

> That's why you need to audit them (also, it's perfectly in the device
> model's right to use address_space_unmap this way: it's the guest that's
> buggy and leaves a dangling reference to a region before unplugging it).
> 
> Paolo

Then maybe it's not too bad that the guest will crash because the memory
was unmapped.
Paolo Bonzini June 19, 2015, 4:44 p.m. UTC | #52
On 19/06/2015 18:33, Michael S. Tsirkin wrote:
> On Fri, Jun 19, 2015 at 06:26:27PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 19/06/2015 18:20, Michael S. Tsirkin wrote:
>>>> We could, but I/O is just an example.  It can be I/O, a network ring,
>>>> whatever.  We cannot audit all address_space_map uses.
>>>>
>>>
>>> No need to audit them all: defer device_add using an hva range until
>>> address_space_unmap drops using hvas in range drops reference count to
>>> 0.
>>
>> That could be forever.  You certainly don't want to lockup the monitor
>> forever just because a device model isn't too friendly to memory hot-unplug.
> 
> We can defer the addition, no need to lockup the monitor.

Patches are welcome.

>> That's why you need to audit them (also, it's perfectly in the device
>> model's right to use address_space_unmap this way: it's the guest that's
>> buggy and leaves a dangling reference to a region before unplugging it).
> 
> Then maybe it's not too bad that the guest will crash because the memory
> was unmapped.

That's a matter of taste.  I strongly prefer using 12K extra memory per
VCPU to a guest crash.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
Michael S. Tsirkin June 19, 2015, 4:45 p.m. UTC | #53
On Fri, Jun 19, 2015 at 10:52:47AM +0200, Paolo Bonzini wrote:
> 
> 
> On 19/06/2015 10:05, Michael S. Tsirkin wrote:
> > > No, only destruction of the memory region frees it.  address_space_map
> > > takes a reference to the memory region and address_space_unmap releases it.
> > > 
> > > Paolo
> > 
> > Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap
> > after we detect refcount is 0?
> 
> No, because in the meanwhile another DIMM could have been hotplugged
> at the same place where the old one was.  This is legal:
> 
>     user                      guest                 QEMU
>     ----------------------------------------------------------------------------------------
>                               start I/O
>                                   '---------------> address_space_map
>     device_del
>         '-------------------> receives SCI
>                               executes _EJ0
>                                   '---------------> memory_region_del_subregion
>                                                     object_unparent
>     device_add
>         '-----------------------------------------> device_set_realized
>                                                       hotplug_handler_plug
>                                                         pc_machine_device_plug_cb
>                                                           pc_dimm_plug
>                                                             memory_region_add_subregion
> 
>                                                     I/O finishes
>                                                       address_space_unmap
> 
> Surprise removal similarly could be done in QEMU, but it will hold to
> some resources for as long as the device backends need them.
> 
> Paolo

OK so what's the problem with checking for this condition,
after address_space_unmap detects that ref count is 0
and before calling mmap(MAP_NORESERVE)?

At that point we are off the data path so we can take locks.
Paolo Bonzini June 19, 2015, 4:50 p.m. UTC | #54
On 19/06/2015 18:45, Michael S. Tsirkin wrote:
> >     user                      guest                 QEMU
> >     ----------------------------------------------------------------------------------------
> >                               start I/O
> >                                   '---------------> address_space_map
> >     device_del
> >         '-------------------> receives SCI
> >                               executes _EJ0
> >                                   '---------------> memory_region_del_subregion
> >                                                     object_unparent
> >     device_add
> >         '-----------------------------------------> device_set_realized
> >                                                       hotplug_handler_plug
> >                                                         pc_machine_device_plug_cb
> >                                                           pc_dimm_plug
> >                                                             memory_region_add_subregion
> > 
> >                                                     I/O finishes
> >                                                       address_space_unmap
> > 
> 
> OK so what's the problem with checking for this condition,
> after address_space_unmap detects that ref count is 0
> and before calling mmap(MAP_NORESERVE)?

Should be okay.  However, someone still needs to figure out the part
where device_add is delayed until the address_space_unmap.  Otherwise,
the device_add's mmap() overlays the older one and the delayed
mmap(MAP_NORESERVE) overlays the device_add's mmap().

It seems like a colossal waste of time to perfect a little-used path
that no doubt will have bugs in it, some of which could perhaps have
security consequences.

Also, QEMU is not the only vhost client, so we're just scratching the
surface here.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
Igor Mammedov June 22, 2015, 7:10 a.m. UTC | #55
On Fri, 19 Jun 2015 18:33:39 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Fri, Jun 19, 2015 at 06:26:27PM +0200, Paolo Bonzini wrote:
> > 
> > 
> > On 19/06/2015 18:20, Michael S. Tsirkin wrote:
> > > > We could, but I/O is just an example.  It can be I/O, a network ring,
> > > > whatever.  We cannot audit all address_space_map uses.
> > > > 
> > >
> > > No need to audit them all: defer device_add using an hva range until
> > > address_space_unmap drops using hvas in range drops reference count to
> > > 0.
> > 
> > That could be forever.  You certainly don't want to lockup the monitor
> > forever just because a device model isn't too friendly to memory hot-unplug.
> 
> We can defer the addition, no need to lockup the monitor.
> 
> > That's why you need to audit them (also, it's perfectly in the device
> > model's right to use address_space_unmap this way: it's the guest that's
> > buggy and leaves a dangling reference to a region before unplugging it).
> > 
> > Paolo
> 
> Then maybe it's not too bad that the guest will crash because the memory
> was unmapped.
So far HVA is unusable even if we will make this assumption and let guest crash.
virt_net doesn't work with it anyway,
translation of GPA to HVA for descriptors works as expected (correctly)
but vhost+HVA hack backed virtio still can't send/received packets.

That's why I prefer to merge kernel solution first as a stable and
not introducing any issues solution. And work on userspace approach on
top of that.

Hopefully it could be done but we still would need time
to iron out side effects/issues it causes or could cause so that
fix became stable enough for production.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
Paolo Bonzini June 22, 2015, 9:45 a.m. UTC | #56
On 22/06/2015 09:10, Igor Mammedov wrote:
> So far HVA is unusable even if we will make this assumption and let guest crash.
> virt_net doesn't work with it anyway,
> translation of GPA to HVA for descriptors works as expected (correctly)
> but vhost+HVA hack backed virtio still can't send/received packets.
> 
> That's why I prefer to merge kernel solution first as a stable and
> not introducing any issues solution. And work on userspace approach on
> top of that.

Also, let's do some math.

Let's assume 3 network devices per VM, one vhost device per queue, one
queue per VCPU per network device.  Let's assume the host is
overcommitted 3:1.

Thus we have 3*3=9 times vhost devices as we have physical CPUs.

We're thus talking about 108K per physical CPU.

From a relative point of view, and assuming 1 GB of memory per physical
CPU (pretty low amount if you're overcommitting CPU 3:1), this is 0.01%
of the total memory.

From an absolute point of view, it takes a system with 60 physical CPUs
to reach the same memory usage as the vmlinuz binary of a typical distro
kernel (not counting the modules).

Paolo

> Hopefully it could be done but we still would need time
> to iron out side effects/issues it causes or could cause so that
> fix became stable enough for production.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
diff mbox

Patch

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 99931a0..6a18c92 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -30,7 +30,7 @@ 
 #include "vhost.h"
 
 enum {
-	VHOST_MEMORY_MAX_NREGIONS = 64,
+	VHOST_MEMORY_MAX_NREGIONS = 509,
 	VHOST_MEMORY_F_LOG = 0x1,
 };