diff mbox series

Can we boot a 512U kvm guest?

Message ID 86aa9609-7dc9-1461-ae47-f50897cd0875@huawei.com (mailing list archive)
State New, archived
Headers show
Series Can we boot a 512U kvm guest? | expand

Commit Message

Zenghui Yu Aug. 13, 2019, 8:50 a.m. UTC
Hi folks,

Since commit e25028c8ded0 ("KVM: arm/arm64: Bump VGIC_V3_MAX_CPUS to
512"), we seemed to be allowed to boot a 512U guest.  But I failed to
start it up with the latest QEMU.  I guess there are at least *two*
reasons (limitations).

First I got a QEMU abort:
	"kvm_set_irq: Invalid argument"

Enable the trace_kvm_irq_line() under debugfs, when it comed with
vcpu-256, I got:
	"Inject UNKNOWN interrupt (3), vcpu->idx: 0, num: 23, level: 0"
and kvm_vm_ioctl_irq_line() returns -EINVAL to user-space...

So the thing is that we only have 8 bits for vcpu_index field ([23:16])
in KVM_IRQ_LINE ioctl.  irq_type field will be corrupted if we inject a
PPI to vcpu-256, whose vcpu_index will take 9 bits.

I temporarily patched the KVM and QEMU with the following diff:

---8<---
---8<---

It makes things a bit better, it also immediately BREAKs the api with
old versions.


Next comes one more QEMU abort (with the "fix" above):
	"Failed to set device address: No space left on device"

We register two io devices (rd_dev and sgi_dev) on KVM_MMIO_BUS for
each redistributor. 512 vcpus take 1024 io devices, which is beyond the
maximum limitation of the current kernel - NR_IOBUS_DEVS (1000).
So we get a ENOSPC error here.


I don't know if the similar problems have been discussed before in ML.
Is it time to really support the 512U guest?


Thanks,
zenghui

Comments

Eric Auger Aug. 13, 2019, 9:44 p.m. UTC | #1
Hi,

On 8/13/19 4:17 PM, Marc Zyngier wrote:
> On Tue, 13 Aug 2019 09:50:27 +0100,
> Zenghui Yu <yuzenghui@huawei.com> wrote:
> 
> Hi Zenghui,
> 
>>
>> Hi folks,
>>
>> Since commit e25028c8ded0 ("KVM: arm/arm64: Bump VGIC_V3_MAX_CPUS to
>> 512"), we seemed to be allowed to boot a 512U guest.  But I failed to
>> start it up with the latest QEMU.  I guess there are at least *two*
>> reasons (limitations).
>>
>> First I got a QEMU abort:
>> 	"kvm_set_irq: Invalid argument"
>>
>> Enable the trace_kvm_irq_line() under debugfs, when it comed with
>> vcpu-256, I got:
>> 	"Inject UNKNOWN interrupt (3), vcpu->idx: 0, num: 23, level: 0"
>> and kvm_vm_ioctl_irq_line() returns -EINVAL to user-space...
>>
>> So the thing is that we only have 8 bits for vcpu_index field ([23:16])
>> in KVM_IRQ_LINE ioctl.  irq_type field will be corrupted if we inject a
>> PPI to vcpu-256, whose vcpu_index will take 9 bits.
> 
> Irk. Not great indeed. Clearly, we have a couple of holes in the way
> we test these ABI changes (/me eyes Eric...).

My bad. At that time I was able to test with 224 vcpus with QEMU (256
vcpus were tested as well) and I failed to see those other limitations,
thinking the only limitation left was the number of redistributor
regions we could register.
> 
>>
>> I temporarily patched the KVM and QEMU with the following diff:
>>
>> ---8<---
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h
>> b/arch/arm64/include/uapi/asm/kvm.h
>> index 95516a4..39a0fb1 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -325,10 +325,10 @@ struct kvm_vcpu_events {
>>  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER		1
>>
>>  /* KVM_IRQ_LINE irq field index values */
>> -#define KVM_ARM_IRQ_TYPE_SHIFT		24
>> -#define KVM_ARM_IRQ_TYPE_MASK		0xff
>> +#define KVM_ARM_IRQ_TYPE_SHIFT		28
>> +#define KVM_ARM_IRQ_TYPE_MASK		0xf
>>  #define KVM_ARM_IRQ_VCPU_SHIFT		16
>> -#define KVM_ARM_IRQ_VCPU_MASK		0xff
>> +#define KVM_ARM_IRQ_VCPU_MASK		0xfff
>>  #define KVM_ARM_IRQ_NUM_SHIFT		0
>>  #define KVM_ARM_IRQ_NUM_MASK		0xffff
>>
>> ---8<---
>>
>> It makes things a bit better, it also immediately BREAKs the api with
>> old versions.
> 
> Yes, and we can't have that (specially if you consider that this API
> is shared between 32 and 64bit). One "get out of jail card" is to
> steal a few bits from the top of the word, and encode things there:
> 
> diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h
> index 4602464ebdfb..86db092e4c2f 100644
> --- a/arch/arm/include/uapi/asm/kvm.h
> +++ b/arch/arm/include/uapi/asm/kvm.h
> @@ -254,8 +254,10 @@ struct kvm_vcpu_events {
>  #define   KVM_DEV_ARM_ITS_CTRL_RESET		4
>  
>  /* KVM_IRQ_LINE irq field index values */
> +#define KVM_ARM_IRQ_VCPU2_SHIFT		28
> +#define KVM_ARM_IRQ_VCPU2_MASK		0xf
>  #define KVM_ARM_IRQ_TYPE_SHIFT		24
> -#define KVM_ARM_IRQ_TYPE_MASK		0xff
> +#define KVM_ARM_IRQ_TYPE_MASK		0xf
>  #define KVM_ARM_IRQ_VCPU_SHIFT		16
>  #define KVM_ARM_IRQ_VCPU_MASK		0xff
>  #define KVM_ARM_IRQ_NUM_SHIFT		0
> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> index 7b7ac0f6cec9..44cb25bfc95e 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -308,8 +308,10 @@ struct kvm_vcpu_events {
>  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER		1
>  
>  /* KVM_IRQ_LINE irq field index values */
> +#define KVM_ARM_IRQ_VCPU2_SHIFT		28
> +#define KVM_ARM_IRQ_VCPU2_MASK		0xf
>  #define KVM_ARM_IRQ_TYPE_SHIFT		24
> -#define KVM_ARM_IRQ_TYPE_MASK		0xff
> +#define KVM_ARM_IRQ_TYPE_MASK		0xf
>  #define KVM_ARM_IRQ_VCPU_SHIFT		16
>  #define KVM_ARM_IRQ_VCPU_MASK		0xff
>  #define KVM_ARM_IRQ_NUM_SHIFT		0
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index 90cedebaeb94..fb685c1c0514 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -889,6 +889,7 @@ int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level,
>  
>  	irq_type = (irq >> KVM_ARM_IRQ_TYPE_SHIFT) & KVM_ARM_IRQ_TYPE_MASK;
>  	vcpu_idx = (irq >> KVM_ARM_IRQ_VCPU_SHIFT) & KVM_ARM_IRQ_VCPU_MASK;
> +	vcpu_idx += ((irq >> KVM_ARM_IRQ_VCPU2_SHIFT) & KVM_ARM_IRQ_VCPU2_MASK) * (KVM_ARM_IRQ_VCPU_MASK + 1);
>  	irq_num = (irq >> KVM_ARM_IRQ_NUM_SHIFT) & KVM_ARM_IRQ_NUM_MASK;
>  
>  	trace_kvm_irq_line(irq_type, vcpu_idx, irq_num, irq_level->level);
> 
> It should work because we've been careful not to allow value outside
> of {0, 1, 2} for irq_type. I don't like it, but I really don't feel
> like adding another IRQ related ioctl. We still have 16 irq types
> (which is already a waste of space), and we can go up to 4096 vcpu.

> 
> Peter, what do you think?
> 
>> Next comes one more QEMU abort (with the "fix" above):
>> 	"Failed to set device address: No space left on device"
>>
>> We register two io devices (rd_dev and sgi_dev) on KVM_MMIO_BUS for
>> each redistributor. 512 vcpus take 1024 io devices, which is beyond the
>> maximum limitation of the current kernel - NR_IOBUS_DEVS (1000).
>> So we get a ENOSPC error here.
> 
> I can reproduce that issue here ("499 vcpus on my Chromebook,
> baby"). Not an ABI problem though, and we can bump it up if that's
> needed.
> 
>> I don't know if the similar problems have been discussed before in ML.
>> Is it time to really support the 512U guest?
> 
> The real question is "why the hell would you want to do that?" ;-)
> Seriously, I'm very interested in finding out what is the use case for
> these gigantic VMs, other than debugging the kernel for big machines.

The initial goal was to keep up with the HPC physical cpu count increase
and allow similar numbers as x86 supports.

Thanks

Eric (with limited internet access atm)

> 
> Thanks,
> 
> 	M.
>
Zenghui Yu Aug. 14, 2019, 6:51 a.m. UTC | #2
On 2019/8/13 22:17, Marc Zyngier wrote:
> On Tue, 13 Aug 2019 09:50:27 +0100,
> Zenghui Yu <yuzenghui@huawei.com> wrote:
> 
> Hi Zenghui,
> 
>>
>> Hi folks,
>>
>> Since commit e25028c8ded0 ("KVM: arm/arm64: Bump VGIC_V3_MAX_CPUS to
>> 512"), we seemed to be allowed to boot a 512U guest.  But I failed to
>> start it up with the latest QEMU.  I guess there are at least *two*
>> reasons (limitations).
>>
>> First I got a QEMU abort:
>> 	"kvm_set_irq: Invalid argument"
>>
>> Enable the trace_kvm_irq_line() under debugfs, when it comed with
>> vcpu-256, I got:
>> 	"Inject UNKNOWN interrupt (3), vcpu->idx: 0, num: 23, level: 0"
>> and kvm_vm_ioctl_irq_line() returns -EINVAL to user-space...
>>
>> So the thing is that we only have 8 bits for vcpu_index field ([23:16])
>> in KVM_IRQ_LINE ioctl.  irq_type field will be corrupted if we inject a
>> PPI to vcpu-256, whose vcpu_index will take 9 bits.
> 
> Irk. Not great indeed. Clearly, we have a couple of holes in the way
> we test these ABI changes (/me eyes Eric...).
> 
>>
>> I temporarily patched the KVM and QEMU with the following diff:
>>
>> ---8<---
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h
>> b/arch/arm64/include/uapi/asm/kvm.h
>> index 95516a4..39a0fb1 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -325,10 +325,10 @@ struct kvm_vcpu_events {
>>   #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER		1
>>
>>   /* KVM_IRQ_LINE irq field index values */
>> -#define KVM_ARM_IRQ_TYPE_SHIFT		24
>> -#define KVM_ARM_IRQ_TYPE_MASK		0xff
>> +#define KVM_ARM_IRQ_TYPE_SHIFT		28
>> +#define KVM_ARM_IRQ_TYPE_MASK		0xf
>>   #define KVM_ARM_IRQ_VCPU_SHIFT		16
>> -#define KVM_ARM_IRQ_VCPU_MASK		0xff
>> +#define KVM_ARM_IRQ_VCPU_MASK		0xfff
>>   #define KVM_ARM_IRQ_NUM_SHIFT		0
>>   #define KVM_ARM_IRQ_NUM_MASK		0xffff
>>
>> ---8<---
>>
>> It makes things a bit better, it also immediately BREAKs the api with
>> old versions.
> 
> Yes, and we can't have that (specially if you consider that this API
> is shared between 32 and 64bit). One "get out of jail card" is to
> steal a few bits from the top of the word, and encode things there:
> 
> diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h
> index 4602464ebdfb..86db092e4c2f 100644
> --- a/arch/arm/include/uapi/asm/kvm.h
> +++ b/arch/arm/include/uapi/asm/kvm.h
> @@ -254,8 +254,10 @@ struct kvm_vcpu_events {
>   #define   KVM_DEV_ARM_ITS_CTRL_RESET		4
>   
>   /* KVM_IRQ_LINE irq field index values */
> +#define KVM_ARM_IRQ_VCPU2_SHIFT		28
> +#define KVM_ARM_IRQ_VCPU2_MASK		0xf
>   #define KVM_ARM_IRQ_TYPE_SHIFT		24
> -#define KVM_ARM_IRQ_TYPE_MASK		0xff
> +#define KVM_ARM_IRQ_TYPE_MASK		0xf
>   #define KVM_ARM_IRQ_VCPU_SHIFT		16
>   #define KVM_ARM_IRQ_VCPU_MASK		0xff
>   #define KVM_ARM_IRQ_NUM_SHIFT		0
> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> index 7b7ac0f6cec9..44cb25bfc95e 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -308,8 +308,10 @@ struct kvm_vcpu_events {
>   #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER		1
>   
>   /* KVM_IRQ_LINE irq field index values */
> +#define KVM_ARM_IRQ_VCPU2_SHIFT		28
> +#define KVM_ARM_IRQ_VCPU2_MASK		0xf
>   #define KVM_ARM_IRQ_TYPE_SHIFT		24
> -#define KVM_ARM_IRQ_TYPE_MASK		0xff
> +#define KVM_ARM_IRQ_TYPE_MASK		0xf
>   #define KVM_ARM_IRQ_VCPU_SHIFT		16
>   #define KVM_ARM_IRQ_VCPU_MASK		0xff
>   #define KVM_ARM_IRQ_NUM_SHIFT		0
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index 90cedebaeb94..fb685c1c0514 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -889,6 +889,7 @@ int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level,
>   
>   	irq_type = (irq >> KVM_ARM_IRQ_TYPE_SHIFT) & KVM_ARM_IRQ_TYPE_MASK;
>   	vcpu_idx = (irq >> KVM_ARM_IRQ_VCPU_SHIFT) & KVM_ARM_IRQ_VCPU_MASK;
> +	vcpu_idx += ((irq >> KVM_ARM_IRQ_VCPU2_SHIFT) & KVM_ARM_IRQ_VCPU2_MASK) * (KVM_ARM_IRQ_VCPU_MASK + 1);
>   	irq_num = (irq >> KVM_ARM_IRQ_NUM_SHIFT) & KVM_ARM_IRQ_NUM_MASK;
>   
>   	trace_kvm_irq_line(irq_type, vcpu_idx, irq_num, irq_level->level);
> 
> It should work because we've been careful not to allow value outside
> of {0, 1, 2} for irq_type. I don't like it, but I really don't feel
> like adding another IRQ related ioctl. We still have 16 irq types
> (which is already a waste of space), and we can go up to 4096 vcpu.

Hi Marc,

Thanks for this. I can test it on my machine once user-space agrees with
this change.

> Peter, what do you think?
> 
>> Next comes one more QEMU abort (with the "fix" above):
>> 	"Failed to set device address: No space left on device"
>>
>> We register two io devices (rd_dev and sgi_dev) on KVM_MMIO_BUS for
>> each redistributor. 512 vcpus take 1024 io devices, which is beyond the
>> maximum limitation of the current kernel - NR_IOBUS_DEVS (1000).
>> So we get a ENOSPC error here.
> 
> I can reproduce that issue here ("499 vcpus on my Chromebook,
> baby"). Not an ABI problem though, and we can bump it up if that's
> needed.
> 
>> I don't know if the similar problems have been discussed before in ML.
>> Is it time to really support the 512U guest?
> 
> The real question is "why the hell would you want to do that?" ;-)
> Seriously, I'm very interested in finding out what is the use case for
> these gigantic VMs, other than debugging the kernel for big machines.

To be honest, I haven't seen any use case yet. In a typical production
environment, *I think* we will not let the guest to have more CPUs than
the host (and currently the maximum we have for host is 128).

(It's just that someone complained that "why can't I start a VM with the
  max supported number of vcpus", and I started investigating yesterday.)


Thanks,
Zenghui
Eric Auger Aug. 22, 2019, 9:08 a.m. UTC | #3
Hi Zenghui,

On 8/13/19 10:50 AM, Zenghui Yu wrote:
> Hi folks,
> 
> Since commit e25028c8ded0 ("KVM: arm/arm64: Bump VGIC_V3_MAX_CPUS to
> 512"), we seemed to be allowed to boot a 512U guest.  But I failed to
> start it up with the latest QEMU.  I guess there are at least *two*
> reasons (limitations).
> 
> First I got a QEMU abort:
>     "kvm_set_irq: Invalid argument"
> 
> Enable the trace_kvm_irq_line() under debugfs, when it comed with
> vcpu-256, I got:
>     "Inject UNKNOWN interrupt (3), vcpu->idx: 0, num: 23, level: 0"
> and kvm_vm_ioctl_irq_line() returns -EINVAL to user-space...
> 
> So the thing is that we only have 8 bits for vcpu_index field ([23:16])
> in KVM_IRQ_LINE ioctl.  irq_type field will be corrupted if we inject a
> PPI to vcpu-256, whose vcpu_index will take 9 bits.
> 
> I temporarily patched the KVM and QEMU with the following diff:
> 
> ---8<---
> diff --git a/arch/arm64/include/uapi/asm/kvm.h
> b/arch/arm64/include/uapi/asm/kvm.h
> index 95516a4..39a0fb1 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -325,10 +325,10 @@ struct kvm_vcpu_events {
>  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER        1
> 
>  /* KVM_IRQ_LINE irq field index values */
> -#define KVM_ARM_IRQ_TYPE_SHIFT        24
> -#define KVM_ARM_IRQ_TYPE_MASK        0xff
> +#define KVM_ARM_IRQ_TYPE_SHIFT        28
> +#define KVM_ARM_IRQ_TYPE_MASK        0xf
>  #define KVM_ARM_IRQ_VCPU_SHIFT        16
> -#define KVM_ARM_IRQ_VCPU_MASK        0xff
> +#define KVM_ARM_IRQ_VCPU_MASK        0xfff
>  #define KVM_ARM_IRQ_NUM_SHIFT        0
>  #define KVM_ARM_IRQ_NUM_MASK        0xffff
> 
> ---8<---
> 
> It makes things a bit better, it also immediately BREAKs the api with
> old versions.
> 
> 
> Next comes one more QEMU abort (with the "fix" above):
>     "Failed to set device address: No space left on device"
> 
> We register two io devices (rd_dev and sgi_dev) on KVM_MMIO_BUS for
> each redistributor. 512 vcpus take 1024 io devices, which is beyond the
> maximum limitation of the current kernel - NR_IOBUS_DEVS (1000).
> So we get a ENOSPC error here.

Do you plan to send a patch for increasing the NR_IOBUS_DEVS? Otherwise
I can do it.

You mentioned you started working on the QEMU fix. Do you plan to submit
patches to the ML or do you want me to attempt to fix the situation on
userspace?

Thanks

Eric
> 
> 
> I don't know if the similar problems have been discussed before in ML.
> Is it time to really support the 512U guest?
> 
> 
> Thanks,
> zenghui
>
Marc Zyngier Aug. 22, 2019, 9:29 a.m. UTC | #4
Hi Eric,

On 22/08/2019 10:08, Auger Eric wrote:
> Hi Zenghui,
> 
> On 8/13/19 10:50 AM, Zenghui Yu wrote:
>> Hi folks,
>>
>> Since commit e25028c8ded0 ("KVM: arm/arm64: Bump VGIC_V3_MAX_CPUS to
>> 512"), we seemed to be allowed to boot a 512U guest.  But I failed to
>> start it up with the latest QEMU.  I guess there are at least *two*
>> reasons (limitations).
>>
>> First I got a QEMU abort:
>>     "kvm_set_irq: Invalid argument"
>>
>> Enable the trace_kvm_irq_line() under debugfs, when it comed with
>> vcpu-256, I got:
>>     "Inject UNKNOWN interrupt (3), vcpu->idx: 0, num: 23, level: 0"
>> and kvm_vm_ioctl_irq_line() returns -EINVAL to user-space...
>>
>> So the thing is that we only have 8 bits for vcpu_index field ([23:16])
>> in KVM_IRQ_LINE ioctl.  irq_type field will be corrupted if we inject a
>> PPI to vcpu-256, whose vcpu_index will take 9 bits.
>>
>> I temporarily patched the KVM and QEMU with the following diff:
>>
>> ---8<---
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h
>> b/arch/arm64/include/uapi/asm/kvm.h
>> index 95516a4..39a0fb1 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -325,10 +325,10 @@ struct kvm_vcpu_events {
>>  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER        1
>>
>>  /* KVM_IRQ_LINE irq field index values */
>> -#define KVM_ARM_IRQ_TYPE_SHIFT        24
>> -#define KVM_ARM_IRQ_TYPE_MASK        0xff
>> +#define KVM_ARM_IRQ_TYPE_SHIFT        28
>> +#define KVM_ARM_IRQ_TYPE_MASK        0xf
>>  #define KVM_ARM_IRQ_VCPU_SHIFT        16
>> -#define KVM_ARM_IRQ_VCPU_MASK        0xff
>> +#define KVM_ARM_IRQ_VCPU_MASK        0xfff
>>  #define KVM_ARM_IRQ_NUM_SHIFT        0
>>  #define KVM_ARM_IRQ_NUM_MASK        0xffff
>>
>> ---8<---
>>
>> It makes things a bit better, it also immediately BREAKs the api with
>> old versions.
>>
>>
>> Next comes one more QEMU abort (with the "fix" above):
>>     "Failed to set device address: No space left on device"
>>
>> We register two io devices (rd_dev and sgi_dev) on KVM_MMIO_BUS for
>> each redistributor. 512 vcpus take 1024 io devices, which is beyond the
>> maximum limitation of the current kernel - NR_IOBUS_DEVS (1000).
>> So we get a ENOSPC error here.
> 
> Do you plan to send a patch for increasing the NR_IOBUS_DEVS? Otherwise
> I can do it.

I really wonder whether that's a sensible thing to do on its own.

Looking at the implementation of kvm_io_bus_register_dev (which copies
the whole array each time we insert a device), we have an obvious issue
with systems that create a large number of device structures, leading to
large transient memory usage and slow guest start.

We could also try and reduce the number of devices we insert by making
the redistributor a single device (which it is in reality). It probably
means we need to make the MMIO decoding more flexible.

Thanks,

	M.
Eric Auger Aug. 22, 2019, 9:50 a.m. UTC | #5
Hi Marc,

On 8/22/19 11:29 AM, Marc Zyngier wrote:
> Hi Eric,
> 
> On 22/08/2019 10:08, Auger Eric wrote:
>> Hi Zenghui,
>>
>> On 8/13/19 10:50 AM, Zenghui Yu wrote:
>>> Hi folks,
>>>
>>> Since commit e25028c8ded0 ("KVM: arm/arm64: Bump VGIC_V3_MAX_CPUS to
>>> 512"), we seemed to be allowed to boot a 512U guest.  But I failed to
>>> start it up with the latest QEMU.  I guess there are at least *two*
>>> reasons (limitations).
>>>
>>> First I got a QEMU abort:
>>>     "kvm_set_irq: Invalid argument"
>>>
>>> Enable the trace_kvm_irq_line() under debugfs, when it comed with
>>> vcpu-256, I got:
>>>     "Inject UNKNOWN interrupt (3), vcpu->idx: 0, num: 23, level: 0"
>>> and kvm_vm_ioctl_irq_line() returns -EINVAL to user-space...
>>>
>>> So the thing is that we only have 8 bits for vcpu_index field ([23:16])
>>> in KVM_IRQ_LINE ioctl.  irq_type field will be corrupted if we inject a
>>> PPI to vcpu-256, whose vcpu_index will take 9 bits.
>>>
>>> I temporarily patched the KVM and QEMU with the following diff:
>>>
>>> ---8<---
>>> diff --git a/arch/arm64/include/uapi/asm/kvm.h
>>> b/arch/arm64/include/uapi/asm/kvm.h
>>> index 95516a4..39a0fb1 100644
>>> --- a/arch/arm64/include/uapi/asm/kvm.h
>>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>>> @@ -325,10 +325,10 @@ struct kvm_vcpu_events {
>>>  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER        1
>>>
>>>  /* KVM_IRQ_LINE irq field index values */
>>> -#define KVM_ARM_IRQ_TYPE_SHIFT        24
>>> -#define KVM_ARM_IRQ_TYPE_MASK        0xff
>>> +#define KVM_ARM_IRQ_TYPE_SHIFT        28
>>> +#define KVM_ARM_IRQ_TYPE_MASK        0xf
>>>  #define KVM_ARM_IRQ_VCPU_SHIFT        16
>>> -#define KVM_ARM_IRQ_VCPU_MASK        0xff
>>> +#define KVM_ARM_IRQ_VCPU_MASK        0xfff
>>>  #define KVM_ARM_IRQ_NUM_SHIFT        0
>>>  #define KVM_ARM_IRQ_NUM_MASK        0xffff
>>>
>>> ---8<---
>>>
>>> It makes things a bit better, it also immediately BREAKs the api with
>>> old versions.
>>>
>>>
>>> Next comes one more QEMU abort (with the "fix" above):
>>>     "Failed to set device address: No space left on device"
>>>
>>> We register two io devices (rd_dev and sgi_dev) on KVM_MMIO_BUS for
>>> each redistributor. 512 vcpus take 1024 io devices, which is beyond the
>>> maximum limitation of the current kernel - NR_IOBUS_DEVS (1000).
>>> So we get a ENOSPC error here.
>>
>> Do you plan to send a patch for increasing the NR_IOBUS_DEVS? Otherwise
>> I can do it.
> 
> I really wonder whether that's a sensible thing to do on its own.
> 
> Looking at the implementation of kvm_io_bus_register_dev (which copies
> the whole array each time we insert a device), we have an obvious issue
> with systems that create a large number of device structures, leading to
> large transient memory usage and slow guest start.
> 
> We could also try and reduce the number of devices we insert by making
> the redistributor a single device (which it is in reality). It probably
> means we need to make the MMIO decoding more flexible.

Yes it makes sense. If no objection, I can work on this as I am the
source of the mess ;-)

Thanks

Eric
> 
> Thanks,
> 
> 	M.
>
Marc Zyngier Aug. 22, 2019, 10:23 a.m. UTC | #6
On 22/08/2019 10:50, Auger Eric wrote:
> Hi Marc,
> 
> On 8/22/19 11:29 AM, Marc Zyngier wrote:
>> Hi Eric,
>>
>> On 22/08/2019 10:08, Auger Eric wrote:
>>> Hi Zenghui,
>>>
>>> On 8/13/19 10:50 AM, Zenghui Yu wrote:
>>>> Hi folks,
>>>>
>>>> Since commit e25028c8ded0 ("KVM: arm/arm64: Bump VGIC_V3_MAX_CPUS to
>>>> 512"), we seemed to be allowed to boot a 512U guest.  But I failed to
>>>> start it up with the latest QEMU.  I guess there are at least *two*
>>>> reasons (limitations).
>>>>
>>>> First I got a QEMU abort:
>>>>     "kvm_set_irq: Invalid argument"
>>>>
>>>> Enable the trace_kvm_irq_line() under debugfs, when it comed with
>>>> vcpu-256, I got:
>>>>     "Inject UNKNOWN interrupt (3), vcpu->idx: 0, num: 23, level: 0"
>>>> and kvm_vm_ioctl_irq_line() returns -EINVAL to user-space...
>>>>
>>>> So the thing is that we only have 8 bits for vcpu_index field ([23:16])
>>>> in KVM_IRQ_LINE ioctl.  irq_type field will be corrupted if we inject a
>>>> PPI to vcpu-256, whose vcpu_index will take 9 bits.
>>>>
>>>> I temporarily patched the KVM and QEMU with the following diff:
>>>>
>>>> ---8<---
>>>> diff --git a/arch/arm64/include/uapi/asm/kvm.h
>>>> b/arch/arm64/include/uapi/asm/kvm.h
>>>> index 95516a4..39a0fb1 100644
>>>> --- a/arch/arm64/include/uapi/asm/kvm.h
>>>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>>>> @@ -325,10 +325,10 @@ struct kvm_vcpu_events {
>>>>  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER        1
>>>>
>>>>  /* KVM_IRQ_LINE irq field index values */
>>>> -#define KVM_ARM_IRQ_TYPE_SHIFT        24
>>>> -#define KVM_ARM_IRQ_TYPE_MASK        0xff
>>>> +#define KVM_ARM_IRQ_TYPE_SHIFT        28
>>>> +#define KVM_ARM_IRQ_TYPE_MASK        0xf
>>>>  #define KVM_ARM_IRQ_VCPU_SHIFT        16
>>>> -#define KVM_ARM_IRQ_VCPU_MASK        0xff
>>>> +#define KVM_ARM_IRQ_VCPU_MASK        0xfff
>>>>  #define KVM_ARM_IRQ_NUM_SHIFT        0
>>>>  #define KVM_ARM_IRQ_NUM_MASK        0xffff
>>>>
>>>> ---8<---
>>>>
>>>> It makes things a bit better, it also immediately BREAKs the api with
>>>> old versions.
>>>>
>>>>
>>>> Next comes one more QEMU abort (with the "fix" above):
>>>>     "Failed to set device address: No space left on device"
>>>>
>>>> We register two io devices (rd_dev and sgi_dev) on KVM_MMIO_BUS for
>>>> each redistributor. 512 vcpus take 1024 io devices, which is beyond the
>>>> maximum limitation of the current kernel - NR_IOBUS_DEVS (1000).
>>>> So we get a ENOSPC error here.
>>>
>>> Do you plan to send a patch for increasing the NR_IOBUS_DEVS? Otherwise
>>> I can do it.
>>
>> I really wonder whether that's a sensible thing to do on its own.
>>
>> Looking at the implementation of kvm_io_bus_register_dev (which copies
>> the whole array each time we insert a device), we have an obvious issue
>> with systems that create a large number of device structures, leading to
>> large transient memory usage and slow guest start.
>>
>> We could also try and reduce the number of devices we insert by making
>> the redistributor a single device (which it is in reality). It probably
>> means we need to make the MMIO decoding more flexible.
> 
> Yes it makes sense. If no objection, I can work on this as I am the
> source of the mess ;-)

Sure, if you have some spare bandwidth, feel free to give it a go. I'd
certainly like to see the userspace counterpart to my earlier patch, and
some agreement on the ABI change.

Thanks,

	M.
Zenghui Yu Aug. 23, 2019, 2:21 a.m. UTC | #7
Hi Eric,

On 2019/8/22 17:08, Auger Eric wrote:
> Hi Zenghui,
> 
> On 8/13/19 10:50 AM, Zenghui Yu wrote:
>> Hi folks,
>>
>> Since commit e25028c8ded0 ("KVM: arm/arm64: Bump VGIC_V3_MAX_CPUS to
>> 512"), we seemed to be allowed to boot a 512U guest.  But I failed to
>> start it up with the latest QEMU.  I guess there are at least *two*
>> reasons (limitations).
>>
>> First I got a QEMU abort:
>>      "kvm_set_irq: Invalid argument"
>>
>> Enable the trace_kvm_irq_line() under debugfs, when it comed with
>> vcpu-256, I got:
>>      "Inject UNKNOWN interrupt (3), vcpu->idx: 0, num: 23, level: 0"
>> and kvm_vm_ioctl_irq_line() returns -EINVAL to user-space...
>>
>> So the thing is that we only have 8 bits for vcpu_index field ([23:16])
>> in KVM_IRQ_LINE ioctl.  irq_type field will be corrupted if we inject a
>> PPI to vcpu-256, whose vcpu_index will take 9 bits.
>>
>> I temporarily patched the KVM and QEMU with the following diff:
>>
>> ---8<---
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h
>> b/arch/arm64/include/uapi/asm/kvm.h
>> index 95516a4..39a0fb1 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -325,10 +325,10 @@ struct kvm_vcpu_events {
>>   #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER        1
>>
>>   /* KVM_IRQ_LINE irq field index values */
>> -#define KVM_ARM_IRQ_TYPE_SHIFT        24
>> -#define KVM_ARM_IRQ_TYPE_MASK        0xff
>> +#define KVM_ARM_IRQ_TYPE_SHIFT        28
>> +#define KVM_ARM_IRQ_TYPE_MASK        0xf
>>   #define KVM_ARM_IRQ_VCPU_SHIFT        16
>> -#define KVM_ARM_IRQ_VCPU_MASK        0xff
>> +#define KVM_ARM_IRQ_VCPU_MASK        0xfff
>>   #define KVM_ARM_IRQ_NUM_SHIFT        0
>>   #define KVM_ARM_IRQ_NUM_MASK        0xffff
>>
>> ---8<---
>>
>> It makes things a bit better, it also immediately BREAKs the api with
>> old versions.
>>
>>
>> Next comes one more QEMU abort (with the "fix" above):
>>      "Failed to set device address: No space left on device"
>>
>> We register two io devices (rd_dev and sgi_dev) on KVM_MMIO_BUS for
>> each redistributor. 512 vcpus take 1024 io devices, which is beyond the
>> maximum limitation of the current kernel - NR_IOBUS_DEVS (1000).
>> So we get a ENOSPC error here.
> 
> Do you plan to send a patch for increasing the NR_IOBUS_DEVS? Otherwise
> I can do it.

[...]

> You mentioned you started working on the QEMU fix. Do you plan to submit
> patches to the ML or do you want me to attempt to fix the situation on
> userspace?
Actually, I haven't started anything yet.  It would be great if you can
help to fix this in QEMU.  Yes, please help. :)

And please cc me when you send the patch so that I can do some tests in
the free time.


Many thanks,
zenghui
Eric Auger Aug. 23, 2019, 7:37 a.m. UTC | #8
Hi Zenghui,
On 8/23/19 4:21 AM, Zenghui Yu wrote:
> Hi Eric,
> 
> On 2019/8/22 17:08, Auger Eric wrote:
>> Hi Zenghui,
>>
>> On 8/13/19 10:50 AM, Zenghui Yu wrote:
>>> Hi folks,
>>>
>>> Since commit e25028c8ded0 ("KVM: arm/arm64: Bump VGIC_V3_MAX_CPUS to
>>> 512"), we seemed to be allowed to boot a 512U guest.  But I failed to
>>> start it up with the latest QEMU.  I guess there are at least *two*
>>> reasons (limitations).
>>>
>>> First I got a QEMU abort:
>>>      "kvm_set_irq: Invalid argument"
>>>
>>> Enable the trace_kvm_irq_line() under debugfs, when it comed with
>>> vcpu-256, I got:
>>>      "Inject UNKNOWN interrupt (3), vcpu->idx: 0, num: 23, level: 0"
>>> and kvm_vm_ioctl_irq_line() returns -EINVAL to user-space...
>>>
>>> So the thing is that we only have 8 bits for vcpu_index field ([23:16])
>>> in KVM_IRQ_LINE ioctl.  irq_type field will be corrupted if we inject a
>>> PPI to vcpu-256, whose vcpu_index will take 9 bits.
>>>
>>> I temporarily patched the KVM and QEMU with the following diff:
>>>
>>> ---8<---
>>> diff --git a/arch/arm64/include/uapi/asm/kvm.h
>>> b/arch/arm64/include/uapi/asm/kvm.h
>>> index 95516a4..39a0fb1 100644
>>> --- a/arch/arm64/include/uapi/asm/kvm.h
>>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>>> @@ -325,10 +325,10 @@ struct kvm_vcpu_events {
>>>   #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER        1
>>>
>>>   /* KVM_IRQ_LINE irq field index values */
>>> -#define KVM_ARM_IRQ_TYPE_SHIFT        24
>>> -#define KVM_ARM_IRQ_TYPE_MASK        0xff
>>> +#define KVM_ARM_IRQ_TYPE_SHIFT        28
>>> +#define KVM_ARM_IRQ_TYPE_MASK        0xf
>>>   #define KVM_ARM_IRQ_VCPU_SHIFT        16
>>> -#define KVM_ARM_IRQ_VCPU_MASK        0xff
>>> +#define KVM_ARM_IRQ_VCPU_MASK        0xfff
>>>   #define KVM_ARM_IRQ_NUM_SHIFT        0
>>>   #define KVM_ARM_IRQ_NUM_MASK        0xffff
>>>
>>> ---8<---
>>>
>>> It makes things a bit better, it also immediately BREAKs the api with
>>> old versions.
>>>
>>>
>>> Next comes one more QEMU abort (with the "fix" above):
>>>      "Failed to set device address: No space left on device"
>>>
>>> We register two io devices (rd_dev and sgi_dev) on KVM_MMIO_BUS for
>>> each redistributor. 512 vcpus take 1024 io devices, which is beyond the
>>> maximum limitation of the current kernel - NR_IOBUS_DEVS (1000).
>>> So we get a ENOSPC error here.
>>
>> Do you plan to send a patch for increasing the NR_IOBUS_DEVS? Otherwise
>> I can do it.
> 
> [...]
> 
>> You mentioned you started working on the QEMU fix. Do you plan to submit
>> patches to the ML or do you want me to attempt to fix the situation on
>> userspace?
> Actually, I haven't started anything yet.  It would be great if you can
> help to fix this in QEMU.  Yes, please help. :)
> 
> And please cc me when you send the patch so that I can do some tests in
> the free time.
Sure, I will work on this.

Thanks

Eric
> 
> 
> Many thanks,
> zenghui
>
diff mbox series

Patch

diff --git a/arch/arm64/include/uapi/asm/kvm.h 
b/arch/arm64/include/uapi/asm/kvm.h
index 95516a4..39a0fb1 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -325,10 +325,10 @@  struct kvm_vcpu_events {
  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER		1

  /* KVM_IRQ_LINE irq field index values */
-#define KVM_ARM_IRQ_TYPE_SHIFT		24
-#define KVM_ARM_IRQ_TYPE_MASK		0xff
+#define KVM_ARM_IRQ_TYPE_SHIFT		28
+#define KVM_ARM_IRQ_TYPE_MASK		0xf
  #define KVM_ARM_IRQ_VCPU_SHIFT		16
-#define KVM_ARM_IRQ_VCPU_MASK		0xff
+#define KVM_ARM_IRQ_VCPU_MASK		0xfff
  #define KVM_ARM_IRQ_NUM_SHIFT		0
  #define KVM_ARM_IRQ_NUM_MASK		0xffff