diff mbox

[2/2] KVM: Create debugfs dir and stat files for each VM

Message ID 565C0B1F.7070606@de.ibm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Christian Borntraeger Nov. 30, 2015, 8:38 a.m. UTC
On 11/27/2015 09:42 PM, Tyler Baker wrote:
> On 27 November 2015 at 10:53, Tyler Baker <tyler.baker@linaro.org> wrote:
>> On 27 November 2015 at 09:08, Tyler Baker <tyler.baker@linaro.org> wrote:
>>> On 27 November 2015 at 00:54, Christian Borntraeger
>>> <borntraeger@de.ibm.com> wrote:
>>>> On 11/26/2015 09:47 PM, Christian Borntraeger wrote:
>>>>> On 11/26/2015 05:17 PM, Tyler Baker wrote:
>>>>>> Hi Christian,
>>>>>>
>>>>>> The kernelci.org bot recently has been reporting kvm guest boot
>>>>>> failures[1] on various arm64 platforms in next-20151126. The bot
>>>>>> bisected[2] the failures to the commit in -next titled "KVM: Create
>>>>>> debugfs dir and stat files for each VM". I confirmed by reverting this
>>>>>> commit on top of next-20151126 it resolves the boot issue.
>>>>>>
>>>>>> In this test case the host and guest are booted with the same kernel.
>>>>>> The host is booted over nfs, installs qemu (qemu-system arm64 2.4.0),
>>>>>> and launches a guest. The host is booting fine, but when the guest is
>>>>>> launched it errors with "Failed to retrieve host CPU features!". I
>>>>>> checked the host logs, and found an "Unable to handle kernel paging
>>>>>> request" splat[3] which occurs when the guest is attempting to start.
>>>>>>
>>>>>> I scanned the patch in question but nothing obvious jumped out at me,
>>>>>> any thoughts?
>>>>>
>>>>> Not really.
>>>>> Do you have processing running that do read the files in /sys/kernel/debug/kvm/* ?
>>>>>
>>>>> If I read the arm oops message correctly it oopsed inside
>>>>> __srcu_read_lock. there is actually nothing in there that can oops,
>>>>> except the access to the preempt count. I am just guessing right now,
>>>>> but maybe the preempt variable is no longer available (as the process
>>>>> is gone). As long as a debugfs file is open, we hold a reference to
>>>>> the kvm, which holds a reference to the mm, so the mm might be killed
>>>>> after the process. But this is supposed to work, so maybe its something
>>>>> different. An objdump of __srcu_read_lock might help.
>>>>
>>>> Hmm, the preempt thing is done in srcu_read_lock, but the crash is in
>>>> __srcu_read_lock. This function gets the srcu struct from mmu_notifier.c,
>>>> which must be present and is initialized during boot.
>>>>
>>>>
>>>> int __srcu_read_lock(struct srcu_struct *sp)
>>>> {
>>>>         int idx;
>>>>
>>>>         idx = READ_ONCE(sp->completed) & 0x1;
>>>>         __this_cpu_inc(sp->per_cpu_ref->c[idx]);
>>>>         smp_mb(); /* B */  /* Avoid leaking the critical section. */
>>>>         __this_cpu_inc(sp->per_cpu_ref->seq[idx]);
>>>>         return idx;
>>>> }
>>>>
>>>> Looking at the code I have no clue why the patch does make a difference.
>>>> Can you try to get an objdump -S for__Srcu_read_lock?
>>
>> Some other interesting finding below...
>>
>> On the host, I do _not_ have any nodes under /sys/kernel/debug/kvm/
>>
>> Running strace on the qemu command I use to launch the guest yields
>> the following.
>>
>> [pid  5963] 1448649724.405537 mmap(NULL, 65536, PROT_READ|PROT_WRITE,
>> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6652a000
>> [pid  5963] 1448649724.405586 read(13, "MemTotal:       16414616
>> kB\nMemF"..., 1024) = 1024
>> [pid  5963] 1448649724.405699 close(13) = 0
>> [pid  5963] 1448649724.405755 munmap(0x7f6652a000, 65536) = 0
>> [pid  5963] 1448649724.405947 brk(0x2552f000) = 0x2552f000
>> [pid  5963] 1448649724.406148 openat(AT_FDCWD, "/dev/kvm",
>> O_RDWR|O_CLOEXEC) = 13
>> [pid  5963] 1448649724.406209 ioctl(13, KVM_CREATE_VM, 0) = -1 ENOMEM
>> (Cannot allocate memory)
> 
> If I comment the call to kvm_create_vm_debugfs(kvm) the guest boots
> fine. I put some printk's in the kvm_create_vm_debugfs() function and
> it's returning -ENOMEM after it evaluates !kvm->debugfs_dentry. I was
> chatting with some folks from the Linaro virtualization team and they
> mentioned that ARM is a bit special as the same PID creates two vms in
> quick succession, the first one is a scratch vm, and the other is the
> 'real' vm. With that bit of info, I suspect we may be trying to create
> the debugfs directory twice, and the second time it's failing because
> it already exists.

Hmmm, with a patched QEMU that calls VM_CREATE twice it errors out on s390
with -ENOMEM (which it should not), but it errors out gracefully.

Does the attached patch avoid the crash? (guest will not start, but qemu
should error out gracefully with ENOMEM)

Comments

Tyler Baker Nov. 30, 2015, 5 p.m. UTC | #1
On 30 November 2015 at 00:38, Christian Borntraeger
<borntraeger@de.ibm.com> wrote:
> On 11/27/2015 09:42 PM, Tyler Baker wrote:
>> On 27 November 2015 at 10:53, Tyler Baker <tyler.baker@linaro.org> wrote:
>>> On 27 November 2015 at 09:08, Tyler Baker <tyler.baker@linaro.org> wrote:
>>>> On 27 November 2015 at 00:54, Christian Borntraeger
>>>> <borntraeger@de.ibm.com> wrote:
>>>>> On 11/26/2015 09:47 PM, Christian Borntraeger wrote:
>>>>>> On 11/26/2015 05:17 PM, Tyler Baker wrote:
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> The kernelci.org bot recently has been reporting kvm guest boot
>>>>>>> failures[1] on various arm64 platforms in next-20151126. The bot
>>>>>>> bisected[2] the failures to the commit in -next titled "KVM: Create
>>>>>>> debugfs dir and stat files for each VM". I confirmed by reverting this
>>>>>>> commit on top of next-20151126 it resolves the boot issue.
>>>>>>>
>>>>>>> In this test case the host and guest are booted with the same kernel.
>>>>>>> The host is booted over nfs, installs qemu (qemu-system arm64 2.4.0),
>>>>>>> and launches a guest. The host is booting fine, but when the guest is
>>>>>>> launched it errors with "Failed to retrieve host CPU features!". I
>>>>>>> checked the host logs, and found an "Unable to handle kernel paging
>>>>>>> request" splat[3] which occurs when the guest is attempting to start.
>>>>>>>
>>>>>>> I scanned the patch in question but nothing obvious jumped out at me,
>>>>>>> any thoughts?
>>>>>>
>>>>>> Not really.
>>>>>> Do you have processing running that do read the files in /sys/kernel/debug/kvm/* ?
>>>>>>
>>>>>> If I read the arm oops message correctly it oopsed inside
>>>>>> __srcu_read_lock. there is actually nothing in there that can oops,
>>>>>> except the access to the preempt count. I am just guessing right now,
>>>>>> but maybe the preempt variable is no longer available (as the process
>>>>>> is gone). As long as a debugfs file is open, we hold a reference to
>>>>>> the kvm, which holds a reference to the mm, so the mm might be killed
>>>>>> after the process. But this is supposed to work, so maybe its something
>>>>>> different. An objdump of __srcu_read_lock might help.
>>>>>
>>>>> Hmm, the preempt thing is done in srcu_read_lock, but the crash is in
>>>>> __srcu_read_lock. This function gets the srcu struct from mmu_notifier.c,
>>>>> which must be present and is initialized during boot.
>>>>>
>>>>>
>>>>> int __srcu_read_lock(struct srcu_struct *sp)
>>>>> {
>>>>>         int idx;
>>>>>
>>>>>         idx = READ_ONCE(sp->completed) & 0x1;
>>>>>         __this_cpu_inc(sp->per_cpu_ref->c[idx]);
>>>>>         smp_mb(); /* B */  /* Avoid leaking the critical section. */
>>>>>         __this_cpu_inc(sp->per_cpu_ref->seq[idx]);
>>>>>         return idx;
>>>>> }
>>>>>
>>>>> Looking at the code I have no clue why the patch does make a difference.
>>>>> Can you try to get an objdump -S for__Srcu_read_lock?
>>>
>>> Some other interesting finding below...
>>>
>>> On the host, I do _not_ have any nodes under /sys/kernel/debug/kvm/
>>>
>>> Running strace on the qemu command I use to launch the guest yields
>>> the following.
>>>
>>> [pid  5963] 1448649724.405537 mmap(NULL, 65536, PROT_READ|PROT_WRITE,
>>> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6652a000
>>> [pid  5963] 1448649724.405586 read(13, "MemTotal:       16414616
>>> kB\nMemF"..., 1024) = 1024
>>> [pid  5963] 1448649724.405699 close(13) = 0
>>> [pid  5963] 1448649724.405755 munmap(0x7f6652a000, 65536) = 0
>>> [pid  5963] 1448649724.405947 brk(0x2552f000) = 0x2552f000
>>> [pid  5963] 1448649724.406148 openat(AT_FDCWD, "/dev/kvm",
>>> O_RDWR|O_CLOEXEC) = 13
>>> [pid  5963] 1448649724.406209 ioctl(13, KVM_CREATE_VM, 0) = -1 ENOMEM
>>> (Cannot allocate memory)
>>
>> If I comment the call to kvm_create_vm_debugfs(kvm) the guest boots
>> fine. I put some printk's in the kvm_create_vm_debugfs() function and
>> it's returning -ENOMEM after it evaluates !kvm->debugfs_dentry. I was
>> chatting with some folks from the Linaro virtualization team and they
>> mentioned that ARM is a bit special as the same PID creates two vms in
>> quick succession, the first one is a scratch vm, and the other is the
>> 'real' vm. With that bit of info, I suspect we may be trying to create
>> the debugfs directory twice, and the second time it's failing because
>> it already exists.
>
> Hmmm, with a patched QEMU that calls VM_CREATE twice it errors out on s390
> with -ENOMEM (which it should not), but it errors out gracefully.
>
> Does the attached patch avoid the crash? (guest will not start, but qemu
> should error out gracefully with ENOMEM)

Yeah. I patched my host kernel and now the qemu guest launch errors
gracefully[1].

Cheers,

Tyler

[1] http://hastebin.com/rotiropayo.mel
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Feb. 2, 2016, 5:15 p.m. UTC | #2
On 30/11/2015 09:38, Christian Borntraeger wrote:
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index f7d6c8f..b26472a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -671,12 +671,16 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  
>  	r = kvm_create_vm_debugfs(kvm);
>  	if (r)
> -		goto out_err;
> +		goto out_mmu;
>  
>  	preempt_notifier_inc();
>  
>  	return kvm;
>  
> +out_mmu:
> +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +	mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
> +#endif
>  out_err:
>  	cleanup_srcu_struct(&kvm->irq_srcu);
>  out_err_no_irq_srcu:

Looking at old unread email in my inbox... can you resubmit this to
kvm@vger.kernel.org as a proper patch (Signed-off-by, commit message, etc.)?

Thanks,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christian Borntraeger Feb. 2, 2016, 7:37 p.m. UTC | #3
On 02/02/2016 06:15 PM, Paolo Bonzini wrote:
> 
> 
> On 30/11/2015 09:38, Christian Borntraeger wrote:
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index f7d6c8f..b26472a 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -671,12 +671,16 @@ static struct kvm *kvm_create_vm(unsigned long type)
>>  
>>  	r = kvm_create_vm_debugfs(kvm);
>>  	if (r)
>> -		goto out_err;
>> +		goto out_mmu;
>>  
>>  	preempt_notifier_inc();
>>  
>>  	return kvm;
>>  
>> +out_mmu:
>> +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>> +	mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
>> +#endif
>>  out_err:
>>  	cleanup_srcu_struct(&kvm->irq_srcu);
>>  out_err_no_irq_srcu:
> 
> Looking at old unread email in my inbox... can you resubmit this to
> kvm@vger.kernel.org as a proper patch (Signed-off-by, commit message, etc.)?
>

This patch was only necessary to fixup an earlier version of Janosch's patch.
upstream is fine.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f7d6c8f..b26472a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -671,12 +671,16 @@  static struct kvm *kvm_create_vm(unsigned long type)
 
 	r = kvm_create_vm_debugfs(kvm);
 	if (r)
-		goto out_err;
+		goto out_mmu;
 
 	preempt_notifier_inc();
 
 	return kvm;
 
+out_mmu:
+#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+	mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
+#endif
 out_err:
 	cleanup_srcu_struct(&kvm->irq_srcu);
 out_err_no_irq_srcu: