diff mbox series

[15/18] Boot Linux Console Test: add a test for aarch64 + virt

Message ID 20190117185628.21862-16-crosa@redhat.com (mailing list archive)
State New, archived
Headers show
Series Acceptance Tests: target architecture support | expand

Commit Message

Cleber Rosa Jan. 17, 2019, 6:56 p.m. UTC
Just like the previous tests, boots a Linux kernel on a aarch64 target
using the virt machine.

One special option added is the CPU type, given that the kernel
selected fails to boot on the virt machine's default CPU (cortex-a15).

Signed-off-by: Cleber Rosa <crosa@redhat.com>
---
 .travis.yml                            |  2 +-
 tests/acceptance/boot_linux_console.py | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

Comments

Caio Carrara Jan. 21, 2019, 8:32 p.m. UTC | #1
On Thu, Jan 17, 2019 at 01:56:25PM -0500, Cleber Rosa wrote:
> Just like the previous tests, boots a Linux kernel on a aarch64 target
> using the virt machine.
> 
> One special option added is the CPU type, given that the kernel
> selected fails to boot on the virt machine's default CPU (cortex-a15).
> 
> Signed-off-by: Cleber Rosa <crosa@redhat.com>

Reviewed-by: Caio Carrara <ccarrara@redhat.com>

> ---
>  .travis.yml                            |  2 +-
>  tests/acceptance/boot_linux_console.py | 20 ++++++++++++++++++++
>  2 files changed, 21 insertions(+), 1 deletion(-)
> 
{...}
>
Wainer dos Santos Moschetta Jan. 31, 2019, 8:02 p.m. UTC | #2
On 01/17/2019 04:56 PM, Cleber Rosa wrote:
> Just like the previous tests, boots a Linux kernel on a aarch64 target
> using the virt machine.
>
> One special option added is the CPU type, given that the kernel
> selected fails to boot on the virt machine's default CPU (cortex-a15).
>
> Signed-off-by: Cleber Rosa <crosa@redhat.com>
> ---
>   .travis.yml                            |  2 +-
>   tests/acceptance/boot_linux_console.py | 20 ++++++++++++++++++++
>   2 files changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/.travis.yml b/.travis.yml
> index 54100eea5a..595e8c0b6c 100644
> --- a/.travis.yml
> +++ b/.travis.yml
> @@ -187,7 +187,7 @@ matrix:
>   
>       # Acceptance (Functional) tests
>       - env:
> -        - CONFIG="--python=/usr/bin/python3 --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu"
> +        - CONFIG="--python=/usr/bin/python3 --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu"
>           - TEST_CMD="make check-acceptance"
>         addons:
>           apt:
> diff --git a/tests/acceptance/boot_linux_console.py b/tests/acceptance/boot_linux_console.py
> index f3ccd23a7a..107700b517 100644
> --- a/tests/acceptance/boot_linux_console.py
> +++ b/tests/acceptance/boot_linux_console.py
> @@ -138,3 +138,23 @@ class BootLinuxConsole(Test):
>           self.vm.launch()
>           console_pattern = 'Kernel command line: %s' % kernel_command_line
>           self.wait_for_console_pattern(console_pattern)
> +
> +    def test_aarch64_virt(self):

That test case fails on my system (Fedora 29 x86_64). Avocado seems 
unable to kill the VM so it  reaches the timeout.

I compiled QEMU with default configuration:

$ configure --python=/usr/bin/python3 --target-list=x86_64-softmmu 
--target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu)

Follows a snippet of the Avocado's job.log file:
----
2019-01-31 14:41:34,912 test             L0602 INFO | START 
07-/root/src/qemu/tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_aarch64_virt
2019-01-31 14:41:34,912 test             L0298 DEBUG| DATA 
(filename=output.expected) => NOT FOUND (data sources: variant, test, file)
2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS (key=arch, 
path=*, default=aarch64) => 'aarch64'
2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS 
(key=qemu_bin, path=*, default=aarch64-softmmu/qemu-system-aarch64) => 
'aarch64-softmmu/qemu-system-aarch64'
2019-01-31 14:41:34,915 download         L0070 INFO | Fetching 
https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz 
-> /var/lib/avocado/data/cache/by_name/vmlinuz.3upct2pr
2019-01-31 14:41:35,490 download         L0054 DEBUG| Retrieved URL 
"https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz": 
content-length 8623423, date: "Thu, 31 Jan 2019 19:41:35 GMT", 
last-modified: "Sun, 21 Oct 2018 00:43:09 GMT"
2019-01-31 14:41:41,765 qemu             L0317 DEBUG| VM launch command: 
'aarch64-softmmu/qemu-system-aarch64 -chardev 
socket,id=mon,path=/var/tmp/tmpizirkcud/qemu-32609-monitor.sock -mon 
chardev=mon,mode=control -display none -vga none -machine virt -chardev 
socket,id=console,path=/var/tmp/tmpizirkcud/qemu-32609-console.sock,server,nowait 
-serial chardev:console -cpu cortex-a53 -kernel 
/var/lib/avocado/data/cache/by_name/vmlinuz -append console=ttyAMA0'
2019-01-31 14:41:41,779 qmp              L0167 DEBUG| >>> {'execute': 
'qmp_capabilities'}
2019-01-31 14:41:41,931 qmp              L0175 DEBUG| <<< {'return': {}}
2019-01-31 14:41:42,830 boot_linux_conso L0041 DEBUG| [    0.000000] 
Booting Linux on physical CPU 0x0000000000 [0x410fd034]

(...)

2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000] 
Policy zone: DMA32
2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000] 
Kernel command line: console=ttyAMA0
2019-01-31 14:41:42,833 qmp              L0167 DEBUG| >>> {'execute': 
'quit'}
2019-01-31 14:44:35,636 qemu             L0357 WARNI| qemu received 
signal -9: aarch64-softmmu/qemu-system-aarch64 -chardev 
socket,id=mon,path=/var/tmp/tmpizirkcud/qemu-32609-monitor.sock -mon 
chardev=mon,mode=control -display none -vga none -machine virt -chardev 
socket,id=console,path=/var/tmp/tmpizirkcud/qemu-32609-console.sock,server,nowait 
-serial chardev:console -cpu cortex-a53 -kernel 
/var/lib/avocado/data/cache/by_name/vmlinuz -append console=ttyAMA0

(...)

2019-01-31 14:44:35,663 runner           L0253 ERROR| ERROR Test 
reported status but did not finish -> TestAbortError: 
07-/root/src/qemu/tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_aarch64_virt.
2019-01-31 14:44:35,664 runner           L0062 ERROR| Runner error 
occurred: Timeout reached
----

- Wainer

> +        """
> +        :avocado: tags=arch:aarch64
> +        :avocado: tags=machine:virt
> +        """
> +        kernel_url = ('https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/'
> +                      'releases/29/Server/aarch64/os/images/pxeboot/vmlinuz')
> +        kernel_hash = '8c73e469fc6ea06a58dc83a628fc695b693b8493'
> +        kernel_path = self.fetch_asset(kernel_url, asset_hash=kernel_hash)
> +
> +        self.vm.set_machine('virt')
> +        self.vm.set_console()
> +        kernel_command_line = 'console=ttyAMA0'
> +        self.vm.add_args('-cpu', 'cortex-a53',
> +                         '-kernel', kernel_path,
> +                         '-append', kernel_command_line)
> +        self.vm.launch()
> +        console_pattern = 'Kernel command line: %s' % kernel_command_line
> +        self.wait_for_console_pattern(console_pattern)
Cleber Rosa Jan. 31, 2019, 8:21 p.m. UTC | #3
On 1/31/19 3:02 PM, Wainer dos Santos Moschetta wrote:
> 
> On 01/17/2019 04:56 PM, Cleber Rosa wrote:
>> Just like the previous tests, boots a Linux kernel on a aarch64 target
>> using the virt machine.
>>
>> One special option added is the CPU type, given that the kernel
>> selected fails to boot on the virt machine's default CPU (cortex-a15).
>>
>> Signed-off-by: Cleber Rosa <crosa@redhat.com>
>> ---
>>   .travis.yml                            |  2 +-
>>   tests/acceptance/boot_linux_console.py | 20 ++++++++++++++++++++
>>   2 files changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/.travis.yml b/.travis.yml
>> index 54100eea5a..595e8c0b6c 100644
>> --- a/.travis.yml
>> +++ b/.travis.yml
>> @@ -187,7 +187,7 @@ matrix:
>>         # Acceptance (Functional) tests
>>       - env:
>> -        - CONFIG="--python=/usr/bin/python3
>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu"
>> +        - CONFIG="--python=/usr/bin/python3
>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu"
>>
>>           - TEST_CMD="make check-acceptance"
>>         addons:
>>           apt:
>> diff --git a/tests/acceptance/boot_linux_console.py
>> b/tests/acceptance/boot_linux_console.py
>> index f3ccd23a7a..107700b517 100644
>> --- a/tests/acceptance/boot_linux_console.py
>> +++ b/tests/acceptance/boot_linux_console.py
>> @@ -138,3 +138,23 @@ class BootLinuxConsole(Test):
>>           self.vm.launch()
>>           console_pattern = 'Kernel command line: %s' %
>> kernel_command_line
>>           self.wait_for_console_pattern(console_pattern)
>> +
>> +    def test_aarch64_virt(self):
> 
> That test case fails on my system (Fedora 29 x86_64). Avocado seems
> unable to kill the VM so it  reaches the timeout.
> 
> I compiled QEMU with default configuration:
> 
> $ configure --python=/usr/bin/python3 --target-list=x86_64-softmmu
> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu)
> 
> 
> Follows a snippet of the Avocado's job.log file:
> ----
> 2019-01-31 14:41:34,912 test             L0602 INFO | START
> 07-/root/src/qemu/tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_aarch64_virt
> 
> 2019-01-31 14:41:34,912 test             L0298 DEBUG| DATA
> (filename=output.expected) => NOT FOUND (data sources: variant, test, file)
> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS (key=arch,
> path=*, default=aarch64) => 'aarch64'
> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS
> (key=qemu_bin, path=*, default=aarch64-softmmu/qemu-system-aarch64) =>
> 'aarch64-softmmu/qemu-system-aarch64'
> 2019-01-31 14:41:34,915 download         L0070 INFO | Fetching
> https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz
> -> /var/lib/avocado/data/cache/by_name/vmlinuz.3upct2pr
> 2019-01-31 14:41:35,490 download         L0054 DEBUG| Retrieved URL
> "https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz":
> content-length 8623423, date: "Thu, 31 Jan 2019 19:41:35 GMT",
> last-modified: "Sun, 21 Oct 2018 00:43:09 GMT"
> 2019-01-31 14:41:41,765 qemu             L0317 DEBUG| VM launch command:
> 'aarch64-softmmu/qemu-system-aarch64 -chardev
> socket,id=mon,path=/var/tmp/tmpizirkcud/qemu-32609-monitor.sock -mon
> chardev=mon,mode=control -display none -vga none -machine virt -chardev
> socket,id=console,path=/var/tmp/tmpizirkcud/qemu-32609-console.sock,server,nowait
> -serial chardev:console -cpu cortex-a53 -kernel
> /var/lib/avocado/data/cache/by_name/vmlinuz -append console=ttyAMA0'
> 2019-01-31 14:41:41,779 qmp              L0167 DEBUG| >>> {'execute':
> 'qmp_capabilities'}
> 2019-01-31 14:41:41,931 qmp              L0175 DEBUG| <<< {'return': {}}
> 2019-01-31 14:41:42,830 boot_linux_conso L0041 DEBUG| [    0.000000]
> Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> 
> (...)
> 
> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
> Policy zone: DMA32
> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
> Kernel command line: console=ttyAMA0
> 2019-01-31 14:41:42,833 qmp              L0167 DEBUG| >>> {'execute':
> 'quit'}

Here, a QMP response like "<<< {'return': {}}" would be expected.

Since I can not reproduce this on my system (or on Travis-CI jobs I've
sent), can you tell me on top of which commit you've applied these patches?

Thanks!
- Cleber.

> 2019-01-31 14:44:35,636 qemu             L0357 WARNI| qemu received
> signal -9: aarch64-softmmu/qemu-system-aarch64 -chardev
> socket,id=mon,path=/var/tmp/tmpizirkcud/qemu-32609-monitor.sock -mon
> chardev=mon,mode=control -display none -vga none -machine virt -chardev
> socket,id=console,path=/var/tmp/tmpizirkcud/qemu-32609-console.sock,server,nowait
> -serial chardev:console -cpu cortex-a53 -kernel
> /var/lib/avocado/data/cache/by_name/vmlinuz -append console=ttyAMA0
> 
> (...)
> 
> 2019-01-31 14:44:35,663 runner           L0253 ERROR| ERROR Test
> reported status but did not finish -> TestAbortError:
> 07-/root/src/qemu/tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_aarch64_virt.
> 
> 2019-01-31 14:44:35,664 runner           L0062 ERROR| Runner error
> occurred: Timeout reached
> ----
> 
> - Wainer
> 
>> +        """
>> +        :avocado: tags=arch:aarch64
>> +        :avocado: tags=machine:virt
>> +        """
>> +        kernel_url =
>> ('https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/'
>> +                     
>> 'releases/29/Server/aarch64/os/images/pxeboot/vmlinuz')
>> +        kernel_hash = '8c73e469fc6ea06a58dc83a628fc695b693b8493'
>> +        kernel_path = self.fetch_asset(kernel_url,
>> asset_hash=kernel_hash)
>> +
>> +        self.vm.set_machine('virt')
>> +        self.vm.set_console()
>> +        kernel_command_line = 'console=ttyAMA0'
>> +        self.vm.add_args('-cpu', 'cortex-a53',
>> +                         '-kernel', kernel_path,
>> +                         '-append', kernel_command_line)
>> +        self.vm.launch()
>> +        console_pattern = 'Kernel command line: %s' %
>> kernel_command_line
>> +        self.wait_for_console_pattern(console_pattern)
>
Cleber Rosa Jan. 31, 2019, 9:26 p.m. UTC | #4
On 1/31/19 3:21 PM, Cleber Rosa wrote:
> 
> 
> On 1/31/19 3:02 PM, Wainer dos Santos Moschetta wrote:
>>
>> On 01/17/2019 04:56 PM, Cleber Rosa wrote:
>>> Just like the previous tests, boots a Linux kernel on a aarch64 target
>>> using the virt machine.
>>>
>>> One special option added is the CPU type, given that the kernel
>>> selected fails to boot on the virt machine's default CPU (cortex-a15).
>>>
>>> Signed-off-by: Cleber Rosa <crosa@redhat.com>
>>> ---
>>>   .travis.yml                            |  2 +-
>>>   tests/acceptance/boot_linux_console.py | 20 ++++++++++++++++++++
>>>   2 files changed, 21 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/.travis.yml b/.travis.yml
>>> index 54100eea5a..595e8c0b6c 100644
>>> --- a/.travis.yml
>>> +++ b/.travis.yml
>>> @@ -187,7 +187,7 @@ matrix:
>>>         # Acceptance (Functional) tests
>>>       - env:
>>> -        - CONFIG="--python=/usr/bin/python3
>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu"
>>> +        - CONFIG="--python=/usr/bin/python3
>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu"
>>>
>>>           - TEST_CMD="make check-acceptance"
>>>         addons:
>>>           apt:
>>> diff --git a/tests/acceptance/boot_linux_console.py
>>> b/tests/acceptance/boot_linux_console.py
>>> index f3ccd23a7a..107700b517 100644
>>> --- a/tests/acceptance/boot_linux_console.py
>>> +++ b/tests/acceptance/boot_linux_console.py
>>> @@ -138,3 +138,23 @@ class BootLinuxConsole(Test):
>>>           self.vm.launch()
>>>           console_pattern = 'Kernel command line: %s' %
>>> kernel_command_line
>>>           self.wait_for_console_pattern(console_pattern)
>>> +
>>> +    def test_aarch64_virt(self):
>>
>> That test case fails on my system (Fedora 29 x86_64). Avocado seems
>> unable to kill the VM so it  reaches the timeout.
>>
>> I compiled QEMU with default configuration:
>>
>> $ configure --python=/usr/bin/python3 --target-list=x86_64-softmmu
>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu)
>>
>>
>> Follows a snippet of the Avocado's job.log file:
>> ----
>> 2019-01-31 14:41:34,912 test             L0602 INFO | START
>> 07-/root/src/qemu/tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_aarch64_virt
>>
>> 2019-01-31 14:41:34,912 test             L0298 DEBUG| DATA
>> (filename=output.expected) => NOT FOUND (data sources: variant, test, file)
>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS (key=arch,
>> path=*, default=aarch64) => 'aarch64'
>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS
>> (key=qemu_bin, path=*, default=aarch64-softmmu/qemu-system-aarch64) =>
>> 'aarch64-softmmu/qemu-system-aarch64'
>> 2019-01-31 14:41:34,915 download         L0070 INFO | Fetching
>> https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz
>> -> /var/lib/avocado/data/cache/by_name/vmlinuz.3upct2pr
>> 2019-01-31 14:41:35,490 download         L0054 DEBUG| Retrieved URL
>> "https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz":
>> content-length 8623423, date: "Thu, 31 Jan 2019 19:41:35 GMT",
>> last-modified: "Sun, 21 Oct 2018 00:43:09 GMT"
>> 2019-01-31 14:41:41,765 qemu             L0317 DEBUG| VM launch command:
>> 'aarch64-softmmu/qemu-system-aarch64 -chardev
>> socket,id=mon,path=/var/tmp/tmpizirkcud/qemu-32609-monitor.sock -mon
>> chardev=mon,mode=control -display none -vga none -machine virt -chardev
>> socket,id=console,path=/var/tmp/tmpizirkcud/qemu-32609-console.sock,server,nowait
>> -serial chardev:console -cpu cortex-a53 -kernel
>> /var/lib/avocado/data/cache/by_name/vmlinuz -append console=ttyAMA0'
>> 2019-01-31 14:41:41,779 qmp              L0167 DEBUG| >>> {'execute':
>> 'qmp_capabilities'}
>> 2019-01-31 14:41:41,931 qmp              L0175 DEBUG| <<< {'return': {}}
>> 2019-01-31 14:41:42,830 boot_linux_conso L0041 DEBUG| [    0.000000]
>> Booting Linux on physical CPU 0x0000000000 [0x410fd034]
>>
>> (...)
>>
>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
>> Policy zone: DMA32
>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
>> Kernel command line: console=ttyAMA0
>> 2019-01-31 14:41:42,833 qmp              L0167 DEBUG| >>> {'execute':
>> 'quit'}
> 
> Here, a QMP response like "<<< {'return': {}}" would be expected.
> 
> Since I can not reproduce this on my system (or on Travis-CI jobs I've
> sent), can you tell me on top of which commit you've applied these patches?
> 

I spoke too soon:

https://travis-ci.org/clebergnu/qemu/jobs/487121425#L3033

This looks like a recent regression, and I'm guessing it's not on the
test's side.  I'll try to bisect it and let you know.

Thanks,
- Cleber.
Cleber Rosa Feb. 1, 2019, 4:10 p.m. UTC | #5
On 1/31/19 4:26 PM, Cleber Rosa wrote:
> 
> 
> On 1/31/19 3:21 PM, Cleber Rosa wrote:
>>
>>
>> On 1/31/19 3:02 PM, Wainer dos Santos Moschetta wrote:
>>>
>>> On 01/17/2019 04:56 PM, Cleber Rosa wrote:
>>>> Just like the previous tests, boots a Linux kernel on a aarch64 target
>>>> using the virt machine.
>>>>
>>>> One special option added is the CPU type, given that the kernel
>>>> selected fails to boot on the virt machine's default CPU (cortex-a15).
>>>>
>>>> Signed-off-by: Cleber Rosa <crosa@redhat.com>
>>>> ---
>>>>   .travis.yml                            |  2 +-
>>>>   tests/acceptance/boot_linux_console.py | 20 ++++++++++++++++++++
>>>>   2 files changed, 21 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/.travis.yml b/.travis.yml
>>>> index 54100eea5a..595e8c0b6c 100644
>>>> --- a/.travis.yml
>>>> +++ b/.travis.yml
>>>> @@ -187,7 +187,7 @@ matrix:
>>>>         # Acceptance (Functional) tests
>>>>       - env:
>>>> -        - CONFIG="--python=/usr/bin/python3
>>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu"
>>>> +        - CONFIG="--python=/usr/bin/python3
>>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu"
>>>>
>>>>           - TEST_CMD="make check-acceptance"
>>>>         addons:
>>>>           apt:
>>>> diff --git a/tests/acceptance/boot_linux_console.py
>>>> b/tests/acceptance/boot_linux_console.py
>>>> index f3ccd23a7a..107700b517 100644
>>>> --- a/tests/acceptance/boot_linux_console.py
>>>> +++ b/tests/acceptance/boot_linux_console.py
>>>> @@ -138,3 +138,23 @@ class BootLinuxConsole(Test):
>>>>           self.vm.launch()
>>>>           console_pattern = 'Kernel command line: %s' %
>>>> kernel_command_line
>>>>           self.wait_for_console_pattern(console_pattern)
>>>> +
>>>> +    def test_aarch64_virt(self):
>>>
>>> That test case fails on my system (Fedora 29 x86_64). Avocado seems
>>> unable to kill the VM so it  reaches the timeout.
>>>
>>> I compiled QEMU with default configuration:
>>>
>>> $ configure --python=/usr/bin/python3 --target-list=x86_64-softmmu
>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu)
>>>
>>>
>>> Follows a snippet of the Avocado's job.log file:
>>> ----
>>> 2019-01-31 14:41:34,912 test             L0602 INFO | START
>>> 07-/root/src/qemu/tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_aarch64_virt
>>>
>>> 2019-01-31 14:41:34,912 test             L0298 DEBUG| DATA
>>> (filename=output.expected) => NOT FOUND (data sources: variant, test, file)
>>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS (key=arch,
>>> path=*, default=aarch64) => 'aarch64'
>>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS
>>> (key=qemu_bin, path=*, default=aarch64-softmmu/qemu-system-aarch64) =>
>>> 'aarch64-softmmu/qemu-system-aarch64'
>>> 2019-01-31 14:41:34,915 download         L0070 INFO | Fetching
>>> https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz
>>> -> /var/lib/avocado/data/cache/by_name/vmlinuz.3upct2pr
>>> 2019-01-31 14:41:35,490 download         L0054 DEBUG| Retrieved URL
>>> "https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz":
>>> content-length 8623423, date: "Thu, 31 Jan 2019 19:41:35 GMT",
>>> last-modified: "Sun, 21 Oct 2018 00:43:09 GMT"
>>> 2019-01-31 14:41:41,765 qemu             L0317 DEBUG| VM launch command:
>>> 'aarch64-softmmu/qemu-system-aarch64 -chardev
>>> socket,id=mon,path=/var/tmp/tmpizirkcud/qemu-32609-monitor.sock -mon
>>> chardev=mon,mode=control -display none -vga none -machine virt -chardev
>>> socket,id=console,path=/var/tmp/tmpizirkcud/qemu-32609-console.sock,server,nowait
>>> -serial chardev:console -cpu cortex-a53 -kernel
>>> /var/lib/avocado/data/cache/by_name/vmlinuz -append console=ttyAMA0'
>>> 2019-01-31 14:41:41,779 qmp              L0167 DEBUG| >>> {'execute':
>>> 'qmp_capabilities'}
>>> 2019-01-31 14:41:41,931 qmp              L0175 DEBUG| <<< {'return': {}}
>>> 2019-01-31 14:41:42,830 boot_linux_conso L0041 DEBUG| [    0.000000]
>>> Booting Linux on physical CPU 0x0000000000 [0x410fd034]
>>>
>>> (...)
>>>
>>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
>>> Policy zone: DMA32
>>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
>>> Kernel command line: console=ttyAMA0
>>> 2019-01-31 14:41:42,833 qmp              L0167 DEBUG| >>> {'execute':
>>> 'quit'}
>>
>> Here, a QMP response like "<<< {'return': {}}" would be expected.
>>
>> Since I can not reproduce this on my system (or on Travis-CI jobs I've
>> sent), can you tell me on top of which commit you've applied these patches?
>>
> 
> I spoke too soon:
> 
> https://travis-ci.org/clebergnu/qemu/jobs/487121425#L3033
> 
> This looks like a recent regression, and I'm guessing it's not on the
> test's side.  I'll try to bisect it and let you know.
> 

On a fresh environment, I am able to get this reproduced on every 2 of
runs, more or less.  When I hit it, I attached GDB to it, and the
backtrace shows:

Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
__lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
103     2:      movl    %edx, %eax
(gdb) bt
#0  __lll_lock_wait () at
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
#1  0x00007fc6ba1a2e09 in __GI___pthread_mutex_lock
(mutex=mutex@entry=0x5615a233d020 <qemu_global_mutex>) at
../nptl/pthread_mutex_lock.c:80
#2  0x00005615a1bb7593 in qemu_mutex_lock_impl (mutex=0x5615a233d020
<qemu_global_mutex>, file=0x5615a1db2d4c "util/main-loop.c", line=236)
at util/qemu-thread-posix.c:66
#3  0x00005615a171125e in qemu_mutex_lock_iothread_impl
(file=file@entry=0x5615a1db2d4c "util/main-loop.c", line=line@entry=236)
at /home/cleber/src/qemu/cpus.c:1849
#4  0x00005615a1bb415d in os_host_main_loop_wait (timeout=<optimized
out>) at util/main-loop.c:236
#5  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
#6  0x00005615a18fdd39 in main_loop () at vl.c:1928
#7  0x00005615a16c9ee9 in main (argc=<optimized out>, argv=<optimized
out>, envp=<optimized out>) at vl.c:4665

Running it with `taskset -c 1` prevents this issue from happening, which
AFAICT, contributes even further towards this being a QEMU race condition.

I'm CC'ing Peter and Claudio (listed maintainers of aarch64), as this
seems to limited to that target.  Any tips on what to do here?

Thanks,
- Cleber.

> Thanks,
> - Cleber.
>
Eduardo Habkost June 7, 2019, 3:26 a.m. UTC | #6
On Fri, Feb 01, 2019 at 11:10:31AM -0500, Cleber Rosa wrote:
> 
> 
> On 1/31/19 4:26 PM, Cleber Rosa wrote:
> > 
> > 
> > On 1/31/19 3:21 PM, Cleber Rosa wrote:
> >>
> >>
> >> On 1/31/19 3:02 PM, Wainer dos Santos Moschetta wrote:
> >>>
> >>> On 01/17/2019 04:56 PM, Cleber Rosa wrote:
> >>>> Just like the previous tests, boots a Linux kernel on a aarch64 target
> >>>> using the virt machine.
> >>>>
> >>>> One special option added is the CPU type, given that the kernel
> >>>> selected fails to boot on the virt machine's default CPU (cortex-a15).
> >>>>
> >>>> Signed-off-by: Cleber Rosa <crosa@redhat.com>
> >>>> ---
> >>>>   .travis.yml                            |  2 +-
> >>>>   tests/acceptance/boot_linux_console.py | 20 ++++++++++++++++++++
> >>>>   2 files changed, 21 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/.travis.yml b/.travis.yml
> >>>> index 54100eea5a..595e8c0b6c 100644
> >>>> --- a/.travis.yml
> >>>> +++ b/.travis.yml
> >>>> @@ -187,7 +187,7 @@ matrix:
> >>>>         # Acceptance (Functional) tests
> >>>>       - env:
> >>>> -        - CONFIG="--python=/usr/bin/python3
> >>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu"
> >>>> +        - CONFIG="--python=/usr/bin/python3
> >>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu"
> >>>>
> >>>>           - TEST_CMD="make check-acceptance"
> >>>>         addons:
> >>>>           apt:
> >>>> diff --git a/tests/acceptance/boot_linux_console.py
> >>>> b/tests/acceptance/boot_linux_console.py
> >>>> index f3ccd23a7a..107700b517 100644
> >>>> --- a/tests/acceptance/boot_linux_console.py
> >>>> +++ b/tests/acceptance/boot_linux_console.py
> >>>> @@ -138,3 +138,23 @@ class BootLinuxConsole(Test):
> >>>>           self.vm.launch()
> >>>>           console_pattern = 'Kernel command line: %s' %
> >>>> kernel_command_line
> >>>>           self.wait_for_console_pattern(console_pattern)
> >>>> +
> >>>> +    def test_aarch64_virt(self):
> >>>
> >>> That test case fails on my system (Fedora 29 x86_64). Avocado seems
> >>> unable to kill the VM so it  reaches the timeout.
> >>>
> >>> I compiled QEMU with default configuration:
> >>>
> >>> $ configure --python=/usr/bin/python3 --target-list=x86_64-softmmu
> >>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu)
> >>>
> >>>
> >>> Follows a snippet of the Avocado's job.log file:
> >>> ----
> >>> 2019-01-31 14:41:34,912 test             L0602 INFO | START
> >>> 07-/root/src/qemu/tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_aarch64_virt
> >>>
> >>> 2019-01-31 14:41:34,912 test             L0298 DEBUG| DATA
> >>> (filename=output.expected) => NOT FOUND (data sources: variant, test, file)
> >>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS (key=arch,
> >>> path=*, default=aarch64) => 'aarch64'
> >>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS
> >>> (key=qemu_bin, path=*, default=aarch64-softmmu/qemu-system-aarch64) =>
> >>> 'aarch64-softmmu/qemu-system-aarch64'
> >>> 2019-01-31 14:41:34,915 download         L0070 INFO | Fetching
> >>> https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz
> >>> -> /var/lib/avocado/data/cache/by_name/vmlinuz.3upct2pr
> >>> 2019-01-31 14:41:35,490 download         L0054 DEBUG| Retrieved URL
> >>> "https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz":
> >>> content-length 8623423, date: "Thu, 31 Jan 2019 19:41:35 GMT",
> >>> last-modified: "Sun, 21 Oct 2018 00:43:09 GMT"
> >>> 2019-01-31 14:41:41,765 qemu             L0317 DEBUG| VM launch command:
> >>> 'aarch64-softmmu/qemu-system-aarch64 -chardev
> >>> socket,id=mon,path=/var/tmp/tmpizirkcud/qemu-32609-monitor.sock -mon
> >>> chardev=mon,mode=control -display none -vga none -machine virt -chardev
> >>> socket,id=console,path=/var/tmp/tmpizirkcud/qemu-32609-console.sock,server,nowait
> >>> -serial chardev:console -cpu cortex-a53 -kernel
> >>> /var/lib/avocado/data/cache/by_name/vmlinuz -append console=ttyAMA0'
> >>> 2019-01-31 14:41:41,779 qmp              L0167 DEBUG| >>> {'execute':
> >>> 'qmp_capabilities'}
> >>> 2019-01-31 14:41:41,931 qmp              L0175 DEBUG| <<< {'return': {}}
> >>> 2019-01-31 14:41:42,830 boot_linux_conso L0041 DEBUG| [    0.000000]
> >>> Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> >>>
> >>> (...)
> >>>
> >>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
> >>> Policy zone: DMA32
> >>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
> >>> Kernel command line: console=ttyAMA0
> >>> 2019-01-31 14:41:42,833 qmp              L0167 DEBUG| >>> {'execute':
> >>> 'quit'}
> >>
> >> Here, a QMP response like "<<< {'return': {}}" would be expected.
> >>
> >> Since I can not reproduce this on my system (or on Travis-CI jobs I've
> >> sent), can you tell me on top of which commit you've applied these patches?
> >>
> > 
> > I spoke too soon:
> > 
> > https://travis-ci.org/clebergnu/qemu/jobs/487121425#L3033
> > 
> > This looks like a recent regression, and I'm guessing it's not on the
> > test's side.  I'll try to bisect it and let you know.
> > 
> 
> On a fresh environment, I am able to get this reproduced on every 2 of
> runs, more or less.  When I hit it, I attached GDB to it, and the
> backtrace shows:
> 
> Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> warning: Loadable section ".note.gnu.property" outside of ELF segments
> __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
> 103     2:      movl    %edx, %eax
> (gdb) bt
> #0  __lll_lock_wait () at
> ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
> #1  0x00007fc6ba1a2e09 in __GI___pthread_mutex_lock
> (mutex=mutex@entry=0x5615a233d020 <qemu_global_mutex>) at
> ../nptl/pthread_mutex_lock.c:80
> #2  0x00005615a1bb7593 in qemu_mutex_lock_impl (mutex=0x5615a233d020
> <qemu_global_mutex>, file=0x5615a1db2d4c "util/main-loop.c", line=236)
> at util/qemu-thread-posix.c:66
> #3  0x00005615a171125e in qemu_mutex_lock_iothread_impl
> (file=file@entry=0x5615a1db2d4c "util/main-loop.c", line=line@entry=236)
> at /home/cleber/src/qemu/cpus.c:1849
> #4  0x00005615a1bb415d in os_host_main_loop_wait (timeout=<optimized
> out>) at util/main-loop.c:236
> #5  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
> #6  0x00005615a18fdd39 in main_loop () at vl.c:1928
> #7  0x00005615a16c9ee9 in main (argc=<optimized out>, argv=<optimized
> out>, envp=<optimized out>) at vl.c:4665

Tip: run "thread apply all bt" so you can get a backtrace of all
threads.


> 
> Running it with `taskset -c 1` prevents this issue from happening, which
> AFAICT, contributes even further towards this being a QEMU race condition.
> 
> I'm CC'ing Peter and Claudio (listed maintainers of aarch64), as this
> seems to limited to that target.  Any tips on what to do here?

I am hitting this on Travis, too, and I finally could reproduce
it locally,

The guest is still writing on the serial console, but nobody is
reading the data on the other side.  A VCPU thread is stuck
inside the EAGAIN/nanosleep loop at qemu_chr_write_buffer(),
holding the QEMU global lock.

Thread 4 (Thread 0x7f2e45fff700 (LWP 6461)):
#0  0x00007f2e4ec03500 in nanosleep () at /lib64/libpthread.so.0
#1  0x00007f2e4fb229d7 in g_usleep () at /lib64/libglib-2.0.so.0
#2  0x0000559a4e7ca4c9 in qemu_chr_write_buffer (s=s@entry=0x559a502d0ac0, buf=buf@entry=0x7f2e45ffdd90 "7", len=1, offset=offset@entry=0x7f2e45ffdd60, write_all=true) at chardev/char.c:115
#3  0x0000559a4e7ca78f in qemu_chr_write (s=0x559a502d0ac0, buf=buf@entry=0x7f2e45ffdd90 "7", len=len@entry=1, write_all=write_all@entry=true) at chardev/char.c:148
#4  0x0000559a4e7cc7e2 in qemu_chr_fe_write_all (be=be@entry=0x559a504b4c50, buf=buf@entry=0x7f2e45ffdd90 "7", len=len@entry=1) at chardev/char-fe.c:53
#5  0x0000559a4e58f320 in pl011_write (opaque=0x559a504b47d0, offset=0, value=55, size=<optimized out>) at hw/char/pl011.c:183
#6  0x0000559a4e325121 in memory_region_write_accessor (mr=0x559a504b4ae0, addr=0, value=<optimized out>, size=2, shift=<optimized out>, mask=<optimized out>, attrs=...) at /home/ehabkost/rh/proj/virt/qemu/memory.c:503
#7  0x0000559a4e322cd6 in access_with_adjusted_size (addr=addr@entry=0, value=value@entry=0x7f2e45ffded8, size=size@entry=2, access_size_min=<optimized out>, access_size_max=<optimized out>, access_fn=access_fn@entry=
    0x559a4e3250a0 <memory_region_write_accessor>, mr=0x559a504b4ae0, attrs=...) at /home/ehabkost/rh/proj/virt/qemu/memory.c:569
#8  0x0000559a4e32763f in memory_region_dispatch_write (mr=mr@entry=0x559a504b4ae0, addr=addr@entry=0, data=<optimized out>, data@entry=55, size=size@entry=2, attrs=...) at /home/ehabkost/rh/proj/virt/qemu/memory.c:1497
#9  0x0000559a4e338708 in io_writex (env=env@entry=0x559a503d5620, mmu_idx=mmu_idx@entry=1, val=val@entry=55, addr=addr@entry=18446462598867529728, retaddr=139836732143069, size=2, iotlbentry=<optimized out>, iotlbentry=<optimized out>)
    at /home/ehabkost/rh/proj/virt/qemu/accel/tcg/cputlb.c:945
#10 0x0000559a4e33d203 in store_helper (big_endian=false, size=2, retaddr=<optimized out>, oi=<optimized out>, val=55, addr=18446462598867529728, env=0x559a503d5620) at /home/ehabkost/rh/proj/virt/qemu/accel/tcg/cputlb.c:1544
#11 0x0000559a4e33d203 in helper_le_stw_mmu (env=0x559a503d5620, addr=18446462598867529728, val=55, oi=<optimized out>, retaddr=139836732143069) at /home/ehabkost/rh/proj/virt/qemu/accel/tcg/cputlb.c:1636
#12 0x00007f2e46bef5dd in code_gen_buffer ()
#13 0x0000559a4e352381 in cpu_tb_exec (itb=<optimized out>, cpu=0x559a503d5620) at /home/ehabkost/rh/proj/virt/qemu/accel/tcg/cpu-exec.c:171
#14 0x0000559a4e352381 in cpu_loop_exec_tb (tb_exit=<synthetic pointer>, last_tb=<synthetic pointer>, tb=<optimized out>, cpu=0x559a503d5620) at /home/ehabkost/rh/proj/virt/qemu/accel/tcg/cpu-exec.c:618
#15 0x0000559a4e352381 in cpu_exec (cpu=cpu@entry=0x559a503cd360) at /home/ehabkost/rh/proj/virt/qemu/accel/tcg/cpu-exec.c:729
#16 0x0000559a4e30ea0f in tcg_cpu_exec (cpu=0x559a503cd360) at /home/ehabkost/rh/proj/virt/qemu/cpus.c:1434
#17 0x0000559a4e310b6b in qemu_tcg_cpu_thread_fn (arg=arg@entry=0x559a503cd360) at /home/ehabkost/rh/proj/virt/qemu/cpus.c:1743
#18 0x0000559a4e83669a in qemu_thread_start (args=<optimized out>) at util/qemu-thread-posix.c:502
#19 0x00007f2e4ebf958e in start_thread () at /lib64/libpthread.so.0
#20 0x00007f2e4eb266f3 in clone () at /lib64/libc.so.6


For reference, this is the QEMU command line:

aarch64-softmmu/qemu-system-aarch64 -chardev socket,id=mon,path=/var/tmp/tmpxnkcjvf0/qemu-6453-monitor.sock -mon chardev=mon,mode=control -display none -vga none -machine virt -chardev socket,id=console,path=/var/tmp/tmpxnkcjvf0/qemu-6453-console.sock,server,nowait -serial chardev:console -cpu cortex-a53 -kernel /home/ehabkost/rh/proj/virt/qemu/tests/venv/data/cache/by_location/e959d0e1dd72e77653e218e666198db1f3d0c213/vmlinuz -append printk.time=0 console=ttyAMA0
Eduardo Habkost June 7, 2019, 3:42 a.m. UTC | #7
On Fri, Jun 07, 2019 at 12:26:48AM -0300, Eduardo Habkost wrote:
> On Fri, Feb 01, 2019 at 11:10:31AM -0500, Cleber Rosa wrote:
> > 
> > 
> > On 1/31/19 4:26 PM, Cleber Rosa wrote:
> > > 
> > > 
> > > On 1/31/19 3:21 PM, Cleber Rosa wrote:
> > >>
> > >>
> > >> On 1/31/19 3:02 PM, Wainer dos Santos Moschetta wrote:
> > >>>
> > >>> On 01/17/2019 04:56 PM, Cleber Rosa wrote:
> > >>>> Just like the previous tests, boots a Linux kernel on a aarch64 target
> > >>>> using the virt machine.
> > >>>>
> > >>>> One special option added is the CPU type, given that the kernel
> > >>>> selected fails to boot on the virt machine's default CPU (cortex-a15).
> > >>>>
> > >>>> Signed-off-by: Cleber Rosa <crosa@redhat.com>
> > >>>> ---
> > >>>>   .travis.yml                            |  2 +-
> > >>>>   tests/acceptance/boot_linux_console.py | 20 ++++++++++++++++++++
> > >>>>   2 files changed, 21 insertions(+), 1 deletion(-)
> > >>>>
> > >>>> diff --git a/.travis.yml b/.travis.yml
> > >>>> index 54100eea5a..595e8c0b6c 100644
> > >>>> --- a/.travis.yml
> > >>>> +++ b/.travis.yml
> > >>>> @@ -187,7 +187,7 @@ matrix:
> > >>>>         # Acceptance (Functional) tests
> > >>>>       - env:
> > >>>> -        - CONFIG="--python=/usr/bin/python3
> > >>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu"
> > >>>> +        - CONFIG="--python=/usr/bin/python3
> > >>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu"
> > >>>>
> > >>>>           - TEST_CMD="make check-acceptance"
> > >>>>         addons:
> > >>>>           apt:
> > >>>> diff --git a/tests/acceptance/boot_linux_console.py
> > >>>> b/tests/acceptance/boot_linux_console.py
> > >>>> index f3ccd23a7a..107700b517 100644
> > >>>> --- a/tests/acceptance/boot_linux_console.py
> > >>>> +++ b/tests/acceptance/boot_linux_console.py
> > >>>> @@ -138,3 +138,23 @@ class BootLinuxConsole(Test):
> > >>>>           self.vm.launch()
> > >>>>           console_pattern = 'Kernel command line: %s' %
> > >>>> kernel_command_line
> > >>>>           self.wait_for_console_pattern(console_pattern)
> > >>>> +
> > >>>> +    def test_aarch64_virt(self):
> > >>>
> > >>> That test case fails on my system (Fedora 29 x86_64). Avocado seems
> > >>> unable to kill the VM so it  reaches the timeout.
> > >>>
> > >>> I compiled QEMU with default configuration:
> > >>>
> > >>> $ configure --python=/usr/bin/python3 --target-list=x86_64-softmmu
> > >>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu)
> > >>>
> > >>>
> > >>> Follows a snippet of the Avocado's job.log file:
> > >>> ----
> > >>> 2019-01-31 14:41:34,912 test             L0602 INFO | START
> > >>> 07-/root/src/qemu/tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_aarch64_virt
> > >>>
> > >>> 2019-01-31 14:41:34,912 test             L0298 DEBUG| DATA
> > >>> (filename=output.expected) => NOT FOUND (data sources: variant, test, file)
> > >>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS (key=arch,
> > >>> path=*, default=aarch64) => 'aarch64'
> > >>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS
> > >>> (key=qemu_bin, path=*, default=aarch64-softmmu/qemu-system-aarch64) =>
> > >>> 'aarch64-softmmu/qemu-system-aarch64'
> > >>> 2019-01-31 14:41:34,915 download         L0070 INFO | Fetching
> > >>> https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz
> > >>> -> /var/lib/avocado/data/cache/by_name/vmlinuz.3upct2pr
> > >>> 2019-01-31 14:41:35,490 download         L0054 DEBUG| Retrieved URL
> > >>> "https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz":
> > >>> content-length 8623423, date: "Thu, 31 Jan 2019 19:41:35 GMT",
> > >>> last-modified: "Sun, 21 Oct 2018 00:43:09 GMT"
> > >>> 2019-01-31 14:41:41,765 qemu             L0317 DEBUG| VM launch command:
> > >>> 'aarch64-softmmu/qemu-system-aarch64 -chardev
> > >>> socket,id=mon,path=/var/tmp/tmpizirkcud/qemu-32609-monitor.sock -mon
> > >>> chardev=mon,mode=control -display none -vga none -machine virt -chardev
> > >>> socket,id=console,path=/var/tmp/tmpizirkcud/qemu-32609-console.sock,server,nowait
> > >>> -serial chardev:console -cpu cortex-a53 -kernel
> > >>> /var/lib/avocado/data/cache/by_name/vmlinuz -append console=ttyAMA0'
> > >>> 2019-01-31 14:41:41,779 qmp              L0167 DEBUG| >>> {'execute':
> > >>> 'qmp_capabilities'}
> > >>> 2019-01-31 14:41:41,931 qmp              L0175 DEBUG| <<< {'return': {}}
> > >>> 2019-01-31 14:41:42,830 boot_linux_conso L0041 DEBUG| [    0.000000]
> > >>> Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> > >>>
> > >>> (...)
> > >>>
> > >>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
> > >>> Policy zone: DMA32
> > >>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
> > >>> Kernel command line: console=ttyAMA0
> > >>> 2019-01-31 14:41:42,833 qmp              L0167 DEBUG| >>> {'execute':
> > >>> 'quit'}
> > >>
> > >> Here, a QMP response like "<<< {'return': {}}" would be expected.
> > >>
> > >> Since I can not reproduce this on my system (or on Travis-CI jobs I've
> > >> sent), can you tell me on top of which commit you've applied these patches?
> > >>
> > > 
> > > I spoke too soon:
> > > 
> > > https://travis-ci.org/clebergnu/qemu/jobs/487121425#L3033
> > > 
> > > This looks like a recent regression, and I'm guessing it's not on the
> > > test's side.  I'll try to bisect it and let you know.
> > > 
> > 
> > On a fresh environment, I am able to get this reproduced on every 2 of
> > runs, more or less.  When I hit it, I attached GDB to it, and the
> > backtrace shows:
> > 
> > Thread debugging using libthread_db enabled]
> > Using host libthread_db library "/lib64/libthread_db.so.1".
> > warning: Loadable section ".note.gnu.property" outside of ELF segments
> > warning: Loadable section ".note.gnu.property" outside of ELF segments
> > __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
> > 103     2:      movl    %edx, %eax
> > (gdb) bt
> > #0  __lll_lock_wait () at
> > ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
> > #1  0x00007fc6ba1a2e09 in __GI___pthread_mutex_lock
> > (mutex=mutex@entry=0x5615a233d020 <qemu_global_mutex>) at
> > ../nptl/pthread_mutex_lock.c:80
> > #2  0x00005615a1bb7593 in qemu_mutex_lock_impl (mutex=0x5615a233d020
> > <qemu_global_mutex>, file=0x5615a1db2d4c "util/main-loop.c", line=236)
> > at util/qemu-thread-posix.c:66
> > #3  0x00005615a171125e in qemu_mutex_lock_iothread_impl
> > (file=file@entry=0x5615a1db2d4c "util/main-loop.c", line=line@entry=236)
> > at /home/cleber/src/qemu/cpus.c:1849
> > #4  0x00005615a1bb415d in os_host_main_loop_wait (timeout=<optimized
> > out>) at util/main-loop.c:236
> > #5  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
> > #6  0x00005615a18fdd39 in main_loop () at vl.c:1928
> > #7  0x00005615a16c9ee9 in main (argc=<optimized out>, argv=<optimized
> > out>, envp=<optimized out>) at vl.c:4665
> 
> Tip: run "thread apply all bt" so you can get a backtrace of all
> threads.
> 
> 
> > 
> > Running it with `taskset -c 1` prevents this issue from happening, which
> > AFAICT, contributes even further towards this being a QEMU race condition.
> > 
> > I'm CC'ing Peter and Claudio (listed maintainers of aarch64), as this
> > seems to limited to that target.  Any tips on what to do here?
> 
> I am hitting this on Travis, too, and I finally could reproduce
> it locally,
> 
> The guest is still writing on the serial console, but nobody is
> reading the data on the other side.  A VCPU thread is stuck
> inside the EAGAIN/nanosleep loop at qemu_chr_write_buffer(),
> holding the QEMU global lock.

Experimental fix below.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
---
 python/qemu/__init__.py | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/python/qemu/__init__.py b/python/qemu/__init__.py
index 81d9657ec0..4a691f34da 100644
--- a/python/qemu/__init__.py
+++ b/python/qemu/__init__.py
@@ -274,10 +274,6 @@ class QEMUMachine(object):
 
         self._qemu_log_path = None
 
-        if self._console_socket is not None:
-            self._console_socket.close()
-            self._console_socket = None
-
         if self._temp_dir is not None:
             shutil.rmtree(self._temp_dir)
             self._temp_dir = None
@@ -336,6 +332,14 @@ class QEMUMachine(object):
         """
         Terminate the VM and clean up
         """
+
+        # If we keep the console socket open, we may deadlock waiting
+        # for QEMU to exit, while QEMU is waiting for the socket to
+        # become writeable.
+        if self._console_socket is not None:
+            self._console_socket.close()
+            self._console_socket = None
+
         if self.is_running():
             try:
                 self._qmp.cmd('quit')
Laszlo Ersek June 7, 2019, 7:41 a.m. UTC | #8
On 06/07/19 05:26, Eduardo Habkost wrote:

> Thread 4 (Thread 0x7f2e45fff700 (LWP 6461)):
> #0  0x00007f2e4ec03500 in nanosleep () at /lib64/libpthread.so.0
> #1  0x00007f2e4fb229d7 in g_usleep () at /lib64/libglib-2.0.so.0
> #2  0x0000559a4e7ca4c9 in qemu_chr_write_buffer (s=s@entry=0x559a502d0ac0, buf=buf@entry=0x7f2e45ffdd90 "7", len=1, offset=offset@entry=0x7f2e45ffdd60, write_all=true) at chardev/char.c:115
> #3  0x0000559a4e7ca78f in qemu_chr_write (s=0x559a502d0ac0, buf=buf@entry=0x7f2e45ffdd90 "7", len=len@entry=1, write_all=write_all@entry=true) at chardev/char.c:148
> #4  0x0000559a4e7cc7e2 in qemu_chr_fe_write_all (be=be@entry=0x559a504b4c50, buf=buf@entry=0x7f2e45ffdd90 "7", len=len@entry=1) at chardev/char-fe.c:53
> #5  0x0000559a4e58f320 in pl011_write (opaque=0x559a504b47d0, offset=0, value=55, size=<optimized out>) at hw/char/pl011.c:183
> #6  0x0000559a4e325121 in memory_region_write_accessor (mr=0x559a504b4ae0, addr=0, value=<optimized out>, size=2, shift=<optimized out>, mask=<optimized out>, attrs=...) at /home/ehabkost/rh/proj/virt/qemu/memory.c:503
> #7  0x0000559a4e322cd6 in access_with_adjusted_size (addr=addr@entry=0, value=value@entry=0x7f2e45ffded8, size=size@entry=2, access_size_min=<optimized out>, access_size_max=<optimized out>, access_fn=access_fn@entry=
>     0x559a4e3250a0 <memory_region_write_accessor>, mr=0x559a504b4ae0, attrs=...) at /home/ehabkost/rh/proj/virt/qemu/memory.c:569
> #8  0x0000559a4e32763f in memory_region_dispatch_write (mr=mr@entry=0x559a504b4ae0, addr=addr@entry=0, data=<optimized out>, data@entry=55, size=size@entry=2, attrs=...) at /home/ehabkost/rh/proj/virt/qemu/memory.c:1497
> #9  0x0000559a4e338708 in io_writex (env=env@entry=0x559a503d5620, mmu_idx=mmu_idx@entry=1, val=val@entry=55, addr=addr@entry=18446462598867529728, retaddr=139836732143069, size=2, iotlbentry=<optimized out>, iotlbentry=<optimized out>)
>     at /home/ehabkost/rh/proj/virt/qemu/accel/tcg/cputlb.c:945
> #10 0x0000559a4e33d203 in store_helper (big_endian=false, size=2, retaddr=<optimized out>, oi=<optimized out>, val=55, addr=18446462598867529728, env=0x559a503d5620) at /home/ehabkost/rh/proj/virt/qemu/accel/tcg/cputlb.c:1544
> #11 0x0000559a4e33d203 in helper_le_stw_mmu (env=0x559a503d5620, addr=18446462598867529728, val=55, oi=<optimized out>, retaddr=139836732143069) at /home/ehabkost/rh/proj/virt/qemu/accel/tcg/cputlb.c:1636
> #12 0x00007f2e46bef5dd in code_gen_buffer ()
> #13 0x0000559a4e352381 in cpu_tb_exec (itb=<optimized out>, cpu=0x559a503d5620) at /home/ehabkost/rh/proj/virt/qemu/accel/tcg/cpu-exec.c:171
> #14 0x0000559a4e352381 in cpu_loop_exec_tb (tb_exit=<synthetic pointer>, last_tb=<synthetic pointer>, tb=<optimized out>, cpu=0x559a503d5620) at /home/ehabkost/rh/proj/virt/qemu/accel/tcg/cpu-exec.c:618
> #15 0x0000559a4e352381 in cpu_exec (cpu=cpu@entry=0x559a503cd360) at /home/ehabkost/rh/proj/virt/qemu/accel/tcg/cpu-exec.c:729
> #16 0x0000559a4e30ea0f in tcg_cpu_exec (cpu=0x559a503cd360) at /home/ehabkost/rh/proj/virt/qemu/cpus.c:1434
> #17 0x0000559a4e310b6b in qemu_tcg_cpu_thread_fn (arg=arg@entry=0x559a503cd360) at /home/ehabkost/rh/proj/virt/qemu/cpus.c:1743
> #18 0x0000559a4e83669a in qemu_thread_start (args=<optimized out>) at util/qemu-thread-posix.c:502
> #19 0x00007f2e4ebf958e in start_thread () at /lib64/libpthread.so.0
> #20 0x00007f2e4eb266f3 in clone () at /lib64/libc.so.6

See also <https://bugzilla.redhat.com/show_bug.cgi?id=1661940>.

$ git show 6ab3fc32ea64 -- hw/char/pl011.c

Thanks
Laszlo
Cleber Rosa June 7, 2019, 3:33 p.m. UTC | #9
On Fri, Jun 07, 2019 at 12:26:48AM -0300, Eduardo Habkost wrote:
> On Fri, Feb 01, 2019 at 11:10:31AM -0500, Cleber Rosa wrote:
> > 
> > 
> > On 1/31/19 4:26 PM, Cleber Rosa wrote:
> > > 
> > > 
> > > On 1/31/19 3:21 PM, Cleber Rosa wrote:
> > >>
> > >>
> > >> On 1/31/19 3:02 PM, Wainer dos Santos Moschetta wrote:
> > >>>
> > >>> On 01/17/2019 04:56 PM, Cleber Rosa wrote:
> > >>>> Just like the previous tests, boots a Linux kernel on a aarch64 target
> > >>>> using the virt machine.
> > >>>>
> > >>>> One special option added is the CPU type, given that the kernel
> > >>>> selected fails to boot on the virt machine's default CPU (cortex-a15).
> > >>>>
> > >>>> Signed-off-by: Cleber Rosa <crosa@redhat.com>
> > >>>> ---
> > >>>>   .travis.yml                            |  2 +-
> > >>>>   tests/acceptance/boot_linux_console.py | 20 ++++++++++++++++++++
> > >>>>   2 files changed, 21 insertions(+), 1 deletion(-)
> > >>>>
> > >>>> diff --git a/.travis.yml b/.travis.yml
> > >>>> index 54100eea5a..595e8c0b6c 100644
> > >>>> --- a/.travis.yml
> > >>>> +++ b/.travis.yml
> > >>>> @@ -187,7 +187,7 @@ matrix:
> > >>>>         # Acceptance (Functional) tests
> > >>>>       - env:
> > >>>> -        - CONFIG="--python=/usr/bin/python3
> > >>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu"
> > >>>> +        - CONFIG="--python=/usr/bin/python3
> > >>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu"
> > >>>>
> > >>>>           - TEST_CMD="make check-acceptance"
> > >>>>         addons:
> > >>>>           apt:
> > >>>> diff --git a/tests/acceptance/boot_linux_console.py
> > >>>> b/tests/acceptance/boot_linux_console.py
> > >>>> index f3ccd23a7a..107700b517 100644
> > >>>> --- a/tests/acceptance/boot_linux_console.py
> > >>>> +++ b/tests/acceptance/boot_linux_console.py
> > >>>> @@ -138,3 +138,23 @@ class BootLinuxConsole(Test):
> > >>>>           self.vm.launch()
> > >>>>           console_pattern = 'Kernel command line: %s' %
> > >>>> kernel_command_line
> > >>>>           self.wait_for_console_pattern(console_pattern)
> > >>>> +
> > >>>> +    def test_aarch64_virt(self):
> > >>>
> > >>> That test case fails on my system (Fedora 29 x86_64). Avocado seems
> > >>> unable to kill the VM so it  reaches the timeout.
> > >>>
> > >>> I compiled QEMU with default configuration:
> > >>>
> > >>> $ configure --python=/usr/bin/python3 --target-list=x86_64-softmmu
> > >>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu)
> > >>>
> > >>>
> > >>> Follows a snippet of the Avocado's job.log file:
> > >>> ----
> > >>> 2019-01-31 14:41:34,912 test             L0602 INFO | START
> > >>> 07-/root/src/qemu/tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_aarch64_virt
> > >>>
> > >>> 2019-01-31 14:41:34,912 test             L0298 DEBUG| DATA
> > >>> (filename=output.expected) => NOT FOUND (data sources: variant, test, file)
> > >>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS (key=arch,
> > >>> path=*, default=aarch64) => 'aarch64'
> > >>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS
> > >>> (key=qemu_bin, path=*, default=aarch64-softmmu/qemu-system-aarch64) =>
> > >>> 'aarch64-softmmu/qemu-system-aarch64'
> > >>> 2019-01-31 14:41:34,915 download         L0070 INFO | Fetching
> > >>> https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz
> > >>> -> /var/lib/avocado/data/cache/by_name/vmlinuz.3upct2pr
> > >>> 2019-01-31 14:41:35,490 download         L0054 DEBUG| Retrieved URL
> > >>> "https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz":
> > >>> content-length 8623423, date: "Thu, 31 Jan 2019 19:41:35 GMT",
> > >>> last-modified: "Sun, 21 Oct 2018 00:43:09 GMT"
> > >>> 2019-01-31 14:41:41,765 qemu             L0317 DEBUG| VM launch command:
> > >>> 'aarch64-softmmu/qemu-system-aarch64 -chardev
> > >>> socket,id=mon,path=/var/tmp/tmpizirkcud/qemu-32609-monitor.sock -mon
> > >>> chardev=mon,mode=control -display none -vga none -machine virt -chardev
> > >>> socket,id=console,path=/var/tmp/tmpizirkcud/qemu-32609-console.sock,server,nowait
> > >>> -serial chardev:console -cpu cortex-a53 -kernel
> > >>> /var/lib/avocado/data/cache/by_name/vmlinuz -append console=ttyAMA0'
> > >>> 2019-01-31 14:41:41,779 qmp              L0167 DEBUG| >>> {'execute':
> > >>> 'qmp_capabilities'}
> > >>> 2019-01-31 14:41:41,931 qmp              L0175 DEBUG| <<< {'return': {}}
> > >>> 2019-01-31 14:41:42,830 boot_linux_conso L0041 DEBUG| [    0.000000]
> > >>> Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> > >>>
> > >>> (...)
> > >>>
> > >>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
> > >>> Policy zone: DMA32
> > >>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
> > >>> Kernel command line: console=ttyAMA0
> > >>> 2019-01-31 14:41:42,833 qmp              L0167 DEBUG| >>> {'execute':
> > >>> 'quit'}
> > >>
> > >> Here, a QMP response like "<<< {'return': {}}" would be expected.
> > >>
> > >> Since I can not reproduce this on my system (or on Travis-CI jobs I've
> > >> sent), can you tell me on top of which commit you've applied these patches?
> > >>
> > > 
> > > I spoke too soon:
> > > 
> > > https://travis-ci.org/clebergnu/qemu/jobs/487121425#L3033
> > > 
> > > This looks like a recent regression, and I'm guessing it's not on the
> > > test's side.  I'll try to bisect it and let you know.
> > > 
> > 
> > On a fresh environment, I am able to get this reproduced on every 2 of
> > runs, more or less.  When I hit it, I attached GDB to it, and the
> > backtrace shows:
> > 
> > Thread debugging using libthread_db enabled]
> > Using host libthread_db library "/lib64/libthread_db.so.1".
> > warning: Loadable section ".note.gnu.property" outside of ELF segments
> > warning: Loadable section ".note.gnu.property" outside of ELF segments
> > __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
> > 103     2:      movl    %edx, %eax
> > (gdb) bt
> > #0  __lll_lock_wait () at
> > ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
> > #1  0x00007fc6ba1a2e09 in __GI___pthread_mutex_lock
> > (mutex=mutex@entry=0x5615a233d020 <qemu_global_mutex>) at
> > ../nptl/pthread_mutex_lock.c:80
> > #2  0x00005615a1bb7593 in qemu_mutex_lock_impl (mutex=0x5615a233d020
> > <qemu_global_mutex>, file=0x5615a1db2d4c "util/main-loop.c", line=236)
> > at util/qemu-thread-posix.c:66
> > #3  0x00005615a171125e in qemu_mutex_lock_iothread_impl
> > (file=file@entry=0x5615a1db2d4c "util/main-loop.c", line=line@entry=236)
> > at /home/cleber/src/qemu/cpus.c:1849
> > #4  0x00005615a1bb415d in os_host_main_loop_wait (timeout=<optimized
> > out>) at util/main-loop.c:236
> > #5  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
> > #6  0x00005615a18fdd39 in main_loop () at vl.c:1928
> > #7  0x00005615a16c9ee9 in main (argc=<optimized out>, argv=<optimized
> > out>, envp=<optimized out>) at vl.c:4665
> 
> Tip: run "thread apply all bt" so you can get a backtrace of all
> threads.
> 
>

Nice, thanks!

> > 
> > Running it with `taskset -c 1` prevents this issue from happening, which
> > AFAICT, contributes even further towards this being a QEMU race condition.
> > 
> > I'm CC'ing Peter and Claudio (listed maintainers of aarch64), as this
> > seems to limited to that target.  Any tips on what to do here?
> 
> I am hitting this on Travis, too, and I finally could reproduce
> it locally,
> 
> The guest is still writing on the serial console, but nobody is
> reading the data on the other side.  A VCPU thread is stuck
> inside the EAGAIN/nanosleep loop at qemu_chr_write_buffer(),
> holding the QEMU global lock.
>

This is awesome, because it really looks like a perfect explanation
for the behavior/reproducer/workaround that I described here:

 https://bugs.launchpad.net/qemu/+bug/1829779

> Thread 4 (Thread 0x7f2e45fff700 (LWP 6461)):
> #0  0x00007f2e4ec03500 in nanosleep () at /lib64/libpthread.so.0
> #1  0x00007f2e4fb229d7 in g_usleep () at /lib64/libglib-2.0.so.0
> #2  0x0000559a4e7ca4c9 in qemu_chr_write_buffer (s=s@entry=0x559a502d0ac0, buf=buf@entry=0x7f2e45ffdd90 "7", len=1, offset=offset@entry=0x7f2e45ffdd60, write_all=true) at chardev/char.c:115
> #3  0x0000559a4e7ca78f in qemu_chr_write (s=0x559a502d0ac0, buf=buf@entry=0x7f2e45ffdd90 "7", len=len@entry=1, write_all=write_all@entry=true) at chardev/char.c:148
> #4  0x0000559a4e7cc7e2 in qemu_chr_fe_write_all (be=be@entry=0x559a504b4c50, buf=buf@entry=0x7f2e45ffdd90 "7", len=len@entry=1) at chardev/char-fe.c:53
> #5  0x0000559a4e58f320 in pl011_write (opaque=0x559a504b47d0, offset=0, value=55, size=<optimized out>) at hw/char/pl011.c:183
> #6  0x0000559a4e325121 in memory_region_write_accessor (mr=0x559a504b4ae0, addr=0, value=<optimized out>, size=2, shift=<optimized out>, mask=<optimized out>, attrs=...) at /home/ehabkost/rh/proj/virt/qemu/memory.c:503
> #7  0x0000559a4e322cd6 in access_with_adjusted_size (addr=addr@entry=0, value=value@entry=0x7f2e45ffded8, size=size@entry=2, access_size_min=<optimized out>, access_size_max=<optimized out>, access_fn=access_fn@entry=
>     0x559a4e3250a0 <memory_region_write_accessor>, mr=0x559a504b4ae0, attrs=...) at /home/ehabkost/rh/proj/virt/qemu/memory.c:569
> #8  0x0000559a4e32763f in memory_region_dispatch_write (mr=mr@entry=0x559a504b4ae0, addr=addr@entry=0, data=<optimized out>, data@entry=55, size=size@entry=2, attrs=...) at /home/ehabkost/rh/proj/virt/qemu/memory.c:1497
> #9  0x0000559a4e338708 in io_writex (env=env@entry=0x559a503d5620, mmu_idx=mmu_idx@entry=1, val=val@entry=55, addr=addr@entry=18446462598867529728, retaddr=139836732143069, size=2, iotlbentry=<optimized out>, iotlbentry=<optimized out>)
>     at /home/ehabkost/rh/proj/virt/qemu/accel/tcg/cputlb.c:945
> #10 0x0000559a4e33d203 in store_helper (big_endian=false, size=2, retaddr=<optimized out>, oi=<optimized out>, val=55, addr=18446462598867529728, env=0x559a503d5620) at /home/ehabkost/rh/proj/virt/qemu/accel/tcg/cputlb.c:1544
> #11 0x0000559a4e33d203 in helper_le_stw_mmu (env=0x559a503d5620, addr=18446462598867529728, val=55, oi=<optimized out>, retaddr=139836732143069) at /home/ehabkost/rh/proj/virt/qemu/accel/tcg/cputlb.c:1636
> #12 0x00007f2e46bef5dd in code_gen_buffer ()
> #13 0x0000559a4e352381 in cpu_tb_exec (itb=<optimized out>, cpu=0x559a503d5620) at /home/ehabkost/rh/proj/virt/qemu/accel/tcg/cpu-exec.c:171
> #14 0x0000559a4e352381 in cpu_loop_exec_tb (tb_exit=<synthetic pointer>, last_tb=<synthetic pointer>, tb=<optimized out>, cpu=0x559a503d5620) at /home/ehabkost/rh/proj/virt/qemu/accel/tcg/cpu-exec.c:618
> #15 0x0000559a4e352381 in cpu_exec (cpu=cpu@entry=0x559a503cd360) at /home/ehabkost/rh/proj/virt/qemu/accel/tcg/cpu-exec.c:729
> #16 0x0000559a4e30ea0f in tcg_cpu_exec (cpu=0x559a503cd360) at /home/ehabkost/rh/proj/virt/qemu/cpus.c:1434
> #17 0x0000559a4e310b6b in qemu_tcg_cpu_thread_fn (arg=arg@entry=0x559a503cd360) at /home/ehabkost/rh/proj/virt/qemu/cpus.c:1743
> #18 0x0000559a4e83669a in qemu_thread_start (args=<optimized out>) at util/qemu-thread-posix.c:502
> #19 0x00007f2e4ebf958e in start_thread () at /lib64/libpthread.so.0
> #20 0x00007f2e4eb266f3 in clone () at /lib64/libc.so.6
> 
> 
> For reference, this is the QEMU command line:
> 
> aarch64-softmmu/qemu-system-aarch64 -chardev socket,id=mon,path=/var/tmp/tmpxnkcjvf0/qemu-6453-monitor.sock -mon chardev=mon,mode=control -display none -vga none -machine virt -chardev socket,id=console,path=/var/tmp/tmpxnkcjvf0/qemu-6453-console.sock,server,nowait -serial chardev:console -cpu cortex-a53 -kernel /home/ehabkost/rh/proj/virt/qemu/tests/venv/data/cache/by_location/e959d0e1dd72e77653e218e666198db1f3d0c213/vmlinuz -append printk.time=0 console=ttyAMA0
> 
> 
> 
> -- 
> Eduardo

Besides the command line, I hope it's also easy to use the following
reproducer:

 https://github.com/clebergnu/qemu/commit/c778e28c24030c4a36548b714293b319f4bf18df

It should apply cleanly to QEMU master, and instructions on how to run
it are documented in the commit message.

Thanks,
- Cleber.
Cleber Rosa June 7, 2019, 3:44 p.m. UTC | #10
On Fri, Jun 07, 2019 at 12:42:14AM -0300, Eduardo Habkost wrote:
> On Fri, Jun 07, 2019 at 12:26:48AM -0300, Eduardo Habkost wrote:
> > On Fri, Feb 01, 2019 at 11:10:31AM -0500, Cleber Rosa wrote:
> > > 
> > > 
> > > On 1/31/19 4:26 PM, Cleber Rosa wrote:
> > > > 
> > > > 
> > > > On 1/31/19 3:21 PM, Cleber Rosa wrote:
> > > >>
> > > >>
> > > >> On 1/31/19 3:02 PM, Wainer dos Santos Moschetta wrote:
> > > >>>
> > > >>> On 01/17/2019 04:56 PM, Cleber Rosa wrote:
> > > >>>> Just like the previous tests, boots a Linux kernel on a aarch64 target
> > > >>>> using the virt machine.
> > > >>>>
> > > >>>> One special option added is the CPU type, given that the kernel
> > > >>>> selected fails to boot on the virt machine's default CPU (cortex-a15).
> > > >>>>
> > > >>>> Signed-off-by: Cleber Rosa <crosa@redhat.com>
> > > >>>> ---
> > > >>>>   .travis.yml                            |  2 +-
> > > >>>>   tests/acceptance/boot_linux_console.py | 20 ++++++++++++++++++++
> > > >>>>   2 files changed, 21 insertions(+), 1 deletion(-)
> > > >>>>
> > > >>>> diff --git a/.travis.yml b/.travis.yml
> > > >>>> index 54100eea5a..595e8c0b6c 100644
> > > >>>> --- a/.travis.yml
> > > >>>> +++ b/.travis.yml
> > > >>>> @@ -187,7 +187,7 @@ matrix:
> > > >>>>         # Acceptance (Functional) tests
> > > >>>>       - env:
> > > >>>> -        - CONFIG="--python=/usr/bin/python3
> > > >>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu"
> > > >>>> +        - CONFIG="--python=/usr/bin/python3
> > > >>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu"
> > > >>>>
> > > >>>>           - TEST_CMD="make check-acceptance"
> > > >>>>         addons:
> > > >>>>           apt:
> > > >>>> diff --git a/tests/acceptance/boot_linux_console.py
> > > >>>> b/tests/acceptance/boot_linux_console.py
> > > >>>> index f3ccd23a7a..107700b517 100644
> > > >>>> --- a/tests/acceptance/boot_linux_console.py
> > > >>>> +++ b/tests/acceptance/boot_linux_console.py
> > > >>>> @@ -138,3 +138,23 @@ class BootLinuxConsole(Test):
> > > >>>>           self.vm.launch()
> > > >>>>           console_pattern = 'Kernel command line: %s' %
> > > >>>> kernel_command_line
> > > >>>>           self.wait_for_console_pattern(console_pattern)
> > > >>>> +
> > > >>>> +    def test_aarch64_virt(self):
> > > >>>
> > > >>> That test case fails on my system (Fedora 29 x86_64). Avocado seems
> > > >>> unable to kill the VM so it  reaches the timeout.
> > > >>>
> > > >>> I compiled QEMU with default configuration:
> > > >>>
> > > >>> $ configure --python=/usr/bin/python3 --target-list=x86_64-softmmu
> > > >>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu)
> > > >>>
> > > >>>
> > > >>> Follows a snippet of the Avocado's job.log file:
> > > >>> ----
> > > >>> 2019-01-31 14:41:34,912 test             L0602 INFO | START
> > > >>> 07-/root/src/qemu/tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_aarch64_virt
> > > >>>
> > > >>> 2019-01-31 14:41:34,912 test             L0298 DEBUG| DATA
> > > >>> (filename=output.expected) => NOT FOUND (data sources: variant, test, file)
> > > >>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS (key=arch,
> > > >>> path=*, default=aarch64) => 'aarch64'
> > > >>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS
> > > >>> (key=qemu_bin, path=*, default=aarch64-softmmu/qemu-system-aarch64) =>
> > > >>> 'aarch64-softmmu/qemu-system-aarch64'
> > > >>> 2019-01-31 14:41:34,915 download         L0070 INFO | Fetching
> > > >>> https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz
> > > >>> -> /var/lib/avocado/data/cache/by_name/vmlinuz.3upct2pr
> > > >>> 2019-01-31 14:41:35,490 download         L0054 DEBUG| Retrieved URL
> > > >>> "https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz":
> > > >>> content-length 8623423, date: "Thu, 31 Jan 2019 19:41:35 GMT",
> > > >>> last-modified: "Sun, 21 Oct 2018 00:43:09 GMT"
> > > >>> 2019-01-31 14:41:41,765 qemu             L0317 DEBUG| VM launch command:
> > > >>> 'aarch64-softmmu/qemu-system-aarch64 -chardev
> > > >>> socket,id=mon,path=/var/tmp/tmpizirkcud/qemu-32609-monitor.sock -mon
> > > >>> chardev=mon,mode=control -display none -vga none -machine virt -chardev
> > > >>> socket,id=console,path=/var/tmp/tmpizirkcud/qemu-32609-console.sock,server,nowait
> > > >>> -serial chardev:console -cpu cortex-a53 -kernel
> > > >>> /var/lib/avocado/data/cache/by_name/vmlinuz -append console=ttyAMA0'
> > > >>> 2019-01-31 14:41:41,779 qmp              L0167 DEBUG| >>> {'execute':
> > > >>> 'qmp_capabilities'}
> > > >>> 2019-01-31 14:41:41,931 qmp              L0175 DEBUG| <<< {'return': {}}
> > > >>> 2019-01-31 14:41:42,830 boot_linux_conso L0041 DEBUG| [    0.000000]
> > > >>> Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> > > >>>
> > > >>> (...)
> > > >>>
> > > >>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
> > > >>> Policy zone: DMA32
> > > >>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
> > > >>> Kernel command line: console=ttyAMA0
> > > >>> 2019-01-31 14:41:42,833 qmp              L0167 DEBUG| >>> {'execute':
> > > >>> 'quit'}
> > > >>
> > > >> Here, a QMP response like "<<< {'return': {}}" would be expected.
> > > >>
> > > >> Since I can not reproduce this on my system (or on Travis-CI jobs I've
> > > >> sent), can you tell me on top of which commit you've applied these patches?
> > > >>
> > > > 
> > > > I spoke too soon:
> > > > 
> > > > https://travis-ci.org/clebergnu/qemu/jobs/487121425#L3033
> > > > 
> > > > This looks like a recent regression, and I'm guessing it's not on the
> > > > test's side.  I'll try to bisect it and let you know.
> > > > 
> > > 
> > > On a fresh environment, I am able to get this reproduced on every 2 of
> > > runs, more or less.  When I hit it, I attached GDB to it, and the
> > > backtrace shows:
> > > 
> > > Thread debugging using libthread_db enabled]
> > > Using host libthread_db library "/lib64/libthread_db.so.1".
> > > warning: Loadable section ".note.gnu.property" outside of ELF segments
> > > warning: Loadable section ".note.gnu.property" outside of ELF segments
> > > __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
> > > 103     2:      movl    %edx, %eax
> > > (gdb) bt
> > > #0  __lll_lock_wait () at
> > > ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
> > > #1  0x00007fc6ba1a2e09 in __GI___pthread_mutex_lock
> > > (mutex=mutex@entry=0x5615a233d020 <qemu_global_mutex>) at
> > > ../nptl/pthread_mutex_lock.c:80
> > > #2  0x00005615a1bb7593 in qemu_mutex_lock_impl (mutex=0x5615a233d020
> > > <qemu_global_mutex>, file=0x5615a1db2d4c "util/main-loop.c", line=236)
> > > at util/qemu-thread-posix.c:66
> > > #3  0x00005615a171125e in qemu_mutex_lock_iothread_impl
> > > (file=file@entry=0x5615a1db2d4c "util/main-loop.c", line=line@entry=236)
> > > at /home/cleber/src/qemu/cpus.c:1849
> > > #4  0x00005615a1bb415d in os_host_main_loop_wait (timeout=<optimized
> > > out>) at util/main-loop.c:236
> > > #5  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
> > > #6  0x00005615a18fdd39 in main_loop () at vl.c:1928
> > > #7  0x00005615a16c9ee9 in main (argc=<optimized out>, argv=<optimized
> > > out>, envp=<optimized out>) at vl.c:4665
> > 
> > Tip: run "thread apply all bt" so you can get a backtrace of all
> > threads.
> > 
> > 
> > > 
> > > Running it with `taskset -c 1` prevents this issue from happening, which
> > > AFAICT, contributes even further towards this being a QEMU race condition.
> > > 
> > > I'm CC'ing Peter and Claudio (listed maintainers of aarch64), as this
> > > seems to limited to that target.  Any tips on what to do here?
> > 
> > I am hitting this on Travis, too, and I finally could reproduce
> > it locally,
> > 
> > The guest is still writing on the serial console, but nobody is
> > reading the data on the other side.  A VCPU thread is stuck
> > inside the EAGAIN/nanosleep loop at qemu_chr_write_buffer(),
> > holding the QEMU global lock.
> 
> Experimental fix below.
> 
> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
> ---
>  python/qemu/__init__.py | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/python/qemu/__init__.py b/python/qemu/__init__.py
> index 81d9657ec0..4a691f34da 100644
> --- a/python/qemu/__init__.py
> +++ b/python/qemu/__init__.py
> @@ -274,10 +274,6 @@ class QEMUMachine(object):
>  
>          self._qemu_log_path = None
>  
> -        if self._console_socket is not None:
> -            self._console_socket.close()
> -            self._console_socket = None
> -
>          if self._temp_dir is not None:
>              shutil.rmtree(self._temp_dir)
>              self._temp_dir = None
> @@ -336,6 +332,14 @@ class QEMUMachine(object):
>          """
>          Terminate the VM and clean up
>          """
> +
> +        # If we keep the console socket open, we may deadlock waiting
> +        # for QEMU to exit, while QEMU is waiting for the socket to
> +        # become writeable.
> +        if self._console_socket is not None:
> +            self._console_socket.close()
> +            self._console_socket = None
> +

Right, this is somewhat equivalent to the following "workaround":

   https://github.com/clebergnu/qemu/commit/e1713f3b91972ad57c089f276c54db3f3fa63423

I could not tell at the moment, though, if closing (or shutting down)
the console socket was the appropriate fix.

What I see is that Rick's commit pointed to by Lazlo is dated from
2016, and it says that a virtio-console fix is necessary.  I'd imagine
that, if a fix was ever proposed, it'd have been incorporated in the
F29 kernel used in this test... and that this workaround/fix would
not be the best solution.

But, I'm mostly attempting to navigate this using my senses , few hard
data and even less background :).

Regards,
- Cleber.

>          if self.is_running():
>              try:
>                  self._qmp.cmd('quit')
> -- 
> 2.18.0.rc1.1.g3f1ff2140
> 
> -- 
> Eduardo
Eduardo Habkost June 7, 2019, 6:58 p.m. UTC | #11
CCing Daniel, who wrote commit 6ab3fc32ea64.

On Fri, Jun 07, 2019 at 11:44:32AM -0400, Cleber Rosa wrote:
> On Fri, Jun 07, 2019 at 12:42:14AM -0300, Eduardo Habkost wrote:
> > On Fri, Jun 07, 2019 at 12:26:48AM -0300, Eduardo Habkost wrote:
> > > On Fri, Feb 01, 2019 at 11:10:31AM -0500, Cleber Rosa wrote:
> > > > 
> > > > 
> > > > On 1/31/19 4:26 PM, Cleber Rosa wrote:
> > > > > 
> > > > > 
> > > > > On 1/31/19 3:21 PM, Cleber Rosa wrote:
> > > > >>
> > > > >>
> > > > >> On 1/31/19 3:02 PM, Wainer dos Santos Moschetta wrote:
> > > > >>>
> > > > >>> On 01/17/2019 04:56 PM, Cleber Rosa wrote:
> > > > >>>> Just like the previous tests, boots a Linux kernel on a aarch64 target
> > > > >>>> using the virt machine.
> > > > >>>>
> > > > >>>> One special option added is the CPU type, given that the kernel
> > > > >>>> selected fails to boot on the virt machine's default CPU (cortex-a15).
> > > > >>>>
> > > > >>>> Signed-off-by: Cleber Rosa <crosa@redhat.com>
> > > > >>>> ---
> > > > >>>>   .travis.yml                            |  2 +-
> > > > >>>>   tests/acceptance/boot_linux_console.py | 20 ++++++++++++++++++++
> > > > >>>>   2 files changed, 21 insertions(+), 1 deletion(-)
> > > > >>>>
> > > > >>>> diff --git a/.travis.yml b/.travis.yml
> > > > >>>> index 54100eea5a..595e8c0b6c 100644
> > > > >>>> --- a/.travis.yml
> > > > >>>> +++ b/.travis.yml
> > > > >>>> @@ -187,7 +187,7 @@ matrix:
> > > > >>>>         # Acceptance (Functional) tests
> > > > >>>>       - env:
> > > > >>>> -        - CONFIG="--python=/usr/bin/python3
> > > > >>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu"
> > > > >>>> +        - CONFIG="--python=/usr/bin/python3
> > > > >>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu"
> > > > >>>>
> > > > >>>>           - TEST_CMD="make check-acceptance"
> > > > >>>>         addons:
> > > > >>>>           apt:
> > > > >>>> diff --git a/tests/acceptance/boot_linux_console.py
> > > > >>>> b/tests/acceptance/boot_linux_console.py
> > > > >>>> index f3ccd23a7a..107700b517 100644
> > > > >>>> --- a/tests/acceptance/boot_linux_console.py
> > > > >>>> +++ b/tests/acceptance/boot_linux_console.py
> > > > >>>> @@ -138,3 +138,23 @@ class BootLinuxConsole(Test):
> > > > >>>>           self.vm.launch()
> > > > >>>>           console_pattern = 'Kernel command line: %s' %
> > > > >>>> kernel_command_line
> > > > >>>>           self.wait_for_console_pattern(console_pattern)
> > > > >>>> +
> > > > >>>> +    def test_aarch64_virt(self):
> > > > >>>
> > > > >>> That test case fails on my system (Fedora 29 x86_64). Avocado seems
> > > > >>> unable to kill the VM so it  reaches the timeout.
> > > > >>>
> > > > >>> I compiled QEMU with default configuration:
> > > > >>>
> > > > >>> $ configure --python=/usr/bin/python3 --target-list=x86_64-softmmu
> > > > >>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu)
> > > > >>>
> > > > >>>
> > > > >>> Follows a snippet of the Avocado's job.log file:
> > > > >>> ----
> > > > >>> 2019-01-31 14:41:34,912 test             L0602 INFO | START
> > > > >>> 07-/root/src/qemu/tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_aarch64_virt
> > > > >>>
> > > > >>> 2019-01-31 14:41:34,912 test             L0298 DEBUG| DATA
> > > > >>> (filename=output.expected) => NOT FOUND (data sources: variant, test, file)
> > > > >>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS (key=arch,
> > > > >>> path=*, default=aarch64) => 'aarch64'
> > > > >>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS
> > > > >>> (key=qemu_bin, path=*, default=aarch64-softmmu/qemu-system-aarch64) =>
> > > > >>> 'aarch64-softmmu/qemu-system-aarch64'
> > > > >>> 2019-01-31 14:41:34,915 download         L0070 INFO | Fetching
> > > > >>> https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz
> > > > >>> -> /var/lib/avocado/data/cache/by_name/vmlinuz.3upct2pr
> > > > >>> 2019-01-31 14:41:35,490 download         L0054 DEBUG| Retrieved URL
> > > > >>> "https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz":
> > > > >>> content-length 8623423, date: "Thu, 31 Jan 2019 19:41:35 GMT",
> > > > >>> last-modified: "Sun, 21 Oct 2018 00:43:09 GMT"
> > > > >>> 2019-01-31 14:41:41,765 qemu             L0317 DEBUG| VM launch command:
> > > > >>> 'aarch64-softmmu/qemu-system-aarch64 -chardev
> > > > >>> socket,id=mon,path=/var/tmp/tmpizirkcud/qemu-32609-monitor.sock -mon
> > > > >>> chardev=mon,mode=control -display none -vga none -machine virt -chardev
> > > > >>> socket,id=console,path=/var/tmp/tmpizirkcud/qemu-32609-console.sock,server,nowait
> > > > >>> -serial chardev:console -cpu cortex-a53 -kernel
> > > > >>> /var/lib/avocado/data/cache/by_name/vmlinuz -append console=ttyAMA0'
> > > > >>> 2019-01-31 14:41:41,779 qmp              L0167 DEBUG| >>> {'execute':
> > > > >>> 'qmp_capabilities'}
> > > > >>> 2019-01-31 14:41:41,931 qmp              L0175 DEBUG| <<< {'return': {}}
> > > > >>> 2019-01-31 14:41:42,830 boot_linux_conso L0041 DEBUG| [    0.000000]
> > > > >>> Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> > > > >>>
> > > > >>> (...)
> > > > >>>
> > > > >>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
> > > > >>> Policy zone: DMA32
> > > > >>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
> > > > >>> Kernel command line: console=ttyAMA0
> > > > >>> 2019-01-31 14:41:42,833 qmp              L0167 DEBUG| >>> {'execute':
> > > > >>> 'quit'}
> > > > >>
> > > > >> Here, a QMP response like "<<< {'return': {}}" would be expected.
> > > > >>
> > > > >> Since I can not reproduce this on my system (or on Travis-CI jobs I've
> > > > >> sent), can you tell me on top of which commit you've applied these patches?
> > > > >>
> > > > > 
> > > > > I spoke too soon:
> > > > > 
> > > > > https://travis-ci.org/clebergnu/qemu/jobs/487121425#L3033
> > > > > 
> > > > > This looks like a recent regression, and I'm guessing it's not on the
> > > > > test's side.  I'll try to bisect it and let you know.
> > > > > 
> > > > 
> > > > On a fresh environment, I am able to get this reproduced on every 2 of
> > > > runs, more or less.  When I hit it, I attached GDB to it, and the
> > > > backtrace shows:
> > > > 
> > > > Thread debugging using libthread_db enabled]
> > > > Using host libthread_db library "/lib64/libthread_db.so.1".
> > > > warning: Loadable section ".note.gnu.property" outside of ELF segments
> > > > warning: Loadable section ".note.gnu.property" outside of ELF segments
> > > > __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
> > > > 103     2:      movl    %edx, %eax
> > > > (gdb) bt
> > > > #0  __lll_lock_wait () at
> > > > ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
> > > > #1  0x00007fc6ba1a2e09 in __GI___pthread_mutex_lock
> > > > (mutex=mutex@entry=0x5615a233d020 <qemu_global_mutex>) at
> > > > ../nptl/pthread_mutex_lock.c:80
> > > > #2  0x00005615a1bb7593 in qemu_mutex_lock_impl (mutex=0x5615a233d020
> > > > <qemu_global_mutex>, file=0x5615a1db2d4c "util/main-loop.c", line=236)
> > > > at util/qemu-thread-posix.c:66
> > > > #3  0x00005615a171125e in qemu_mutex_lock_iothread_impl
> > > > (file=file@entry=0x5615a1db2d4c "util/main-loop.c", line=line@entry=236)
> > > > at /home/cleber/src/qemu/cpus.c:1849
> > > > #4  0x00005615a1bb415d in os_host_main_loop_wait (timeout=<optimized
> > > > out>) at util/main-loop.c:236
> > > > #5  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
> > > > #6  0x00005615a18fdd39 in main_loop () at vl.c:1928
> > > > #7  0x00005615a16c9ee9 in main (argc=<optimized out>, argv=<optimized
> > > > out>, envp=<optimized out>) at vl.c:4665
> > > 
> > > Tip: run "thread apply all bt" so you can get a backtrace of all
> > > threads.
> > > 
> > > 
> > > > 
> > > > Running it with `taskset -c 1` prevents this issue from happening, which
> > > > AFAICT, contributes even further towards this being a QEMU race condition.
> > > > 
> > > > I'm CC'ing Peter and Claudio (listed maintainers of aarch64), as this
> > > > seems to limited to that target.  Any tips on what to do here?
> > > 
> > > I am hitting this on Travis, too, and I finally could reproduce
> > > it locally,
> > > 
> > > The guest is still writing on the serial console, but nobody is
> > > reading the data on the other side.  A VCPU thread is stuck
> > > inside the EAGAIN/nanosleep loop at qemu_chr_write_buffer(),
> > > holding the QEMU global lock.
> > 
> > Experimental fix below.
> > 
> > Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
> > ---
> >  python/qemu/__init__.py | 12 ++++++++----
> >  1 file changed, 8 insertions(+), 4 deletions(-)
> > 
> > diff --git a/python/qemu/__init__.py b/python/qemu/__init__.py
> > index 81d9657ec0..4a691f34da 100644
> > --- a/python/qemu/__init__.py
> > +++ b/python/qemu/__init__.py
> > @@ -274,10 +274,6 @@ class QEMUMachine(object):
> >  
> >          self._qemu_log_path = None
> >  
> > -        if self._console_socket is not None:
> > -            self._console_socket.close()
> > -            self._console_socket = None
> > -
> >          if self._temp_dir is not None:
> >              shutil.rmtree(self._temp_dir)
> >              self._temp_dir = None
> > @@ -336,6 +332,14 @@ class QEMUMachine(object):
> >          """
> >          Terminate the VM and clean up
> >          """
> > +
> > +        # If we keep the console socket open, we may deadlock waiting
> > +        # for QEMU to exit, while QEMU is waiting for the socket to
> > +        # become writeable.
> > +        if self._console_socket is not None:
> > +            self._console_socket.close()
> > +            self._console_socket = None
> > +
> 
> Right, this is somewhat equivalent to the following "workaround":
> 
>    https://github.com/clebergnu/qemu/commit/e1713f3b91972ad57c089f276c54db3f3fa63423
> 
> I could not tell at the moment, though, if closing (or shutting down)
> the console socket was the appropriate fix.
> 
> What I see is that Rick's commit pointed to by Lazlo is dated from
> 2016, and it says that a virtio-console fix is necessary.  I'd imagine
> that, if a fix was ever proposed, it'd have been incorporated in the
> F29 kernel used in this test... and that this workaround/fix would
> not be the best solution.

If I understood this correctly, fixing the guest driver wouldn't
help here.  The commit mentioned by Laszlo (6ab3fc32ea64 "hw:
replace most use of qemu_chr_fe_write with qemu_chr_fe_write_all")
fixes data loss bugs but introduces the potential for deadlocks
like the one we are seeing.

Unless we replace the existing ~30 qemu_chr_fe_write_all() calls
in device code, we need to make sure we are constantly reading
data from the console socket.

> 
> But, I'm mostly attempting to navigate this using my senses , few hard
> data and even less background :).
> 
> Regards,
> - Cleber.
> 
> >          if self.is_running():
> >              try:
> >                  self._qmp.cmd('quit')
> > -- 
> > 2.18.0.rc1.1.g3f1ff2140
> > 
> > -- 
> > Eduardo
Daniel P. Berrangé June 10, 2019, 8:53 a.m. UTC | #12
On Fri, Jun 07, 2019 at 03:58:57PM -0300, Eduardo Habkost wrote:
> CCing Daniel, who wrote commit 6ab3fc32ea64.
> 
> On Fri, Jun 07, 2019 at 11:44:32AM -0400, Cleber Rosa wrote:
> > On Fri, Jun 07, 2019 at 12:42:14AM -0300, Eduardo Habkost wrote:
> > > On Fri, Jun 07, 2019 at 12:26:48AM -0300, Eduardo Habkost wrote:
> > > > On Fri, Feb 01, 2019 at 11:10:31AM -0500, Cleber Rosa wrote:
> > > > > 
> > > > > 
> > > > > On 1/31/19 4:26 PM, Cleber Rosa wrote:
> > > > > > 
> > > > > > 
> > > > > > On 1/31/19 3:21 PM, Cleber Rosa wrote:
> > > > > >>
> > > > > >>
> > > > > >> On 1/31/19 3:02 PM, Wainer dos Santos Moschetta wrote:
> > > > > >>>
> > > > > >>> On 01/17/2019 04:56 PM, Cleber Rosa wrote:
> > > > > >>>> Just like the previous tests, boots a Linux kernel on a aarch64 target
> > > > > >>>> using the virt machine.
> > > > > >>>>
> > > > > >>>> One special option added is the CPU type, given that the kernel
> > > > > >>>> selected fails to boot on the virt machine's default CPU (cortex-a15).
> > > > > >>>>
> > > > > >>>> Signed-off-by: Cleber Rosa <crosa@redhat.com>
> > > > > >>>> ---
> > > > > >>>>   .travis.yml                            |  2 +-
> > > > > >>>>   tests/acceptance/boot_linux_console.py | 20 ++++++++++++++++++++
> > > > > >>>>   2 files changed, 21 insertions(+), 1 deletion(-)
> > > > > >>>>
> > > > > >>>> diff --git a/.travis.yml b/.travis.yml
> > > > > >>>> index 54100eea5a..595e8c0b6c 100644
> > > > > >>>> --- a/.travis.yml
> > > > > >>>> +++ b/.travis.yml
> > > > > >>>> @@ -187,7 +187,7 @@ matrix:
> > > > > >>>>         # Acceptance (Functional) tests
> > > > > >>>>       - env:
> > > > > >>>> -        - CONFIG="--python=/usr/bin/python3
> > > > > >>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu"
> > > > > >>>> +        - CONFIG="--python=/usr/bin/python3
> > > > > >>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu"
> > > > > >>>>
> > > > > >>>>           - TEST_CMD="make check-acceptance"
> > > > > >>>>         addons:
> > > > > >>>>           apt:
> > > > > >>>> diff --git a/tests/acceptance/boot_linux_console.py
> > > > > >>>> b/tests/acceptance/boot_linux_console.py
> > > > > >>>> index f3ccd23a7a..107700b517 100644
> > > > > >>>> --- a/tests/acceptance/boot_linux_console.py
> > > > > >>>> +++ b/tests/acceptance/boot_linux_console.py
> > > > > >>>> @@ -138,3 +138,23 @@ class BootLinuxConsole(Test):
> > > > > >>>>           self.vm.launch()
> > > > > >>>>           console_pattern = 'Kernel command line: %s' %
> > > > > >>>> kernel_command_line
> > > > > >>>>           self.wait_for_console_pattern(console_pattern)
> > > > > >>>> +
> > > > > >>>> +    def test_aarch64_virt(self):
> > > > > >>>
> > > > > >>> That test case fails on my system (Fedora 29 x86_64). Avocado seems
> > > > > >>> unable to kill the VM so it  reaches the timeout.
> > > > > >>>
> > > > > >>> I compiled QEMU with default configuration:
> > > > > >>>
> > > > > >>> $ configure --python=/usr/bin/python3 --target-list=x86_64-softmmu
> > > > > >>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu)
> > > > > >>>
> > > > > >>>
> > > > > >>> Follows a snippet of the Avocado's job.log file:
> > > > > >>> ----
> > > > > >>> 2019-01-31 14:41:34,912 test             L0602 INFO | START
> > > > > >>> 07-/root/src/qemu/tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_aarch64_virt
> > > > > >>>
> > > > > >>> 2019-01-31 14:41:34,912 test             L0298 DEBUG| DATA
> > > > > >>> (filename=output.expected) => NOT FOUND (data sources: variant, test, file)
> > > > > >>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS (key=arch,
> > > > > >>> path=*, default=aarch64) => 'aarch64'
> > > > > >>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS
> > > > > >>> (key=qemu_bin, path=*, default=aarch64-softmmu/qemu-system-aarch64) =>
> > > > > >>> 'aarch64-softmmu/qemu-system-aarch64'
> > > > > >>> 2019-01-31 14:41:34,915 download         L0070 INFO | Fetching
> > > > > >>> https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz
> > > > > >>> -> /var/lib/avocado/data/cache/by_name/vmlinuz.3upct2pr
> > > > > >>> 2019-01-31 14:41:35,490 download         L0054 DEBUG| Retrieved URL
> > > > > >>> "https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz":
> > > > > >>> content-length 8623423, date: "Thu, 31 Jan 2019 19:41:35 GMT",
> > > > > >>> last-modified: "Sun, 21 Oct 2018 00:43:09 GMT"
> > > > > >>> 2019-01-31 14:41:41,765 qemu             L0317 DEBUG| VM launch command:
> > > > > >>> 'aarch64-softmmu/qemu-system-aarch64 -chardev
> > > > > >>> socket,id=mon,path=/var/tmp/tmpizirkcud/qemu-32609-monitor.sock -mon
> > > > > >>> chardev=mon,mode=control -display none -vga none -machine virt -chardev
> > > > > >>> socket,id=console,path=/var/tmp/tmpizirkcud/qemu-32609-console.sock,server,nowait
> > > > > >>> -serial chardev:console -cpu cortex-a53 -kernel
> > > > > >>> /var/lib/avocado/data/cache/by_name/vmlinuz -append console=ttyAMA0'
> > > > > >>> 2019-01-31 14:41:41,779 qmp              L0167 DEBUG| >>> {'execute':
> > > > > >>> 'qmp_capabilities'}
> > > > > >>> 2019-01-31 14:41:41,931 qmp              L0175 DEBUG| <<< {'return': {}}
> > > > > >>> 2019-01-31 14:41:42,830 boot_linux_conso L0041 DEBUG| [    0.000000]
> > > > > >>> Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> > > > > >>>
> > > > > >>> (...)
> > > > > >>>
> > > > > >>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
> > > > > >>> Policy zone: DMA32
> > > > > >>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
> > > > > >>> Kernel command line: console=ttyAMA0
> > > > > >>> 2019-01-31 14:41:42,833 qmp              L0167 DEBUG| >>> {'execute':
> > > > > >>> 'quit'}
> > > > > >>
> > > > > >> Here, a QMP response like "<<< {'return': {}}" would be expected.
> > > > > >>
> > > > > >> Since I can not reproduce this on my system (or on Travis-CI jobs I've
> > > > > >> sent), can you tell me on top of which commit you've applied these patches?
> > > > > >>
> > > > > > 
> > > > > > I spoke too soon:
> > > > > > 
> > > > > > https://travis-ci.org/clebergnu/qemu/jobs/487121425#L3033
> > > > > > 
> > > > > > This looks like a recent regression, and I'm guessing it's not on the
> > > > > > test's side.  I'll try to bisect it and let you know.
> > > > > > 
> > > > > 
> > > > > On a fresh environment, I am able to get this reproduced on every 2 of
> > > > > runs, more or less.  When I hit it, I attached GDB to it, and the
> > > > > backtrace shows:
> > > > > 
> > > > > Thread debugging using libthread_db enabled]
> > > > > Using host libthread_db library "/lib64/libthread_db.so.1".
> > > > > warning: Loadable section ".note.gnu.property" outside of ELF segments
> > > > > warning: Loadable section ".note.gnu.property" outside of ELF segments
> > > > > __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
> > > > > 103     2:      movl    %edx, %eax
> > > > > (gdb) bt
> > > > > #0  __lll_lock_wait () at
> > > > > ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
> > > > > #1  0x00007fc6ba1a2e09 in __GI___pthread_mutex_lock
> > > > > (mutex=mutex@entry=0x5615a233d020 <qemu_global_mutex>) at
> > > > > ../nptl/pthread_mutex_lock.c:80
> > > > > #2  0x00005615a1bb7593 in qemu_mutex_lock_impl (mutex=0x5615a233d020
> > > > > <qemu_global_mutex>, file=0x5615a1db2d4c "util/main-loop.c", line=236)
> > > > > at util/qemu-thread-posix.c:66
> > > > > #3  0x00005615a171125e in qemu_mutex_lock_iothread_impl
> > > > > (file=file@entry=0x5615a1db2d4c "util/main-loop.c", line=line@entry=236)
> > > > > at /home/cleber/src/qemu/cpus.c:1849
> > > > > #4  0x00005615a1bb415d in os_host_main_loop_wait (timeout=<optimized
> > > > > out>) at util/main-loop.c:236
> > > > > #5  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
> > > > > #6  0x00005615a18fdd39 in main_loop () at vl.c:1928
> > > > > #7  0x00005615a16c9ee9 in main (argc=<optimized out>, argv=<optimized
> > > > > out>, envp=<optimized out>) at vl.c:4665
> > > > 
> > > > Tip: run "thread apply all bt" so you can get a backtrace of all
> > > > threads.
> > > > 
> > > > 
> > > > > 
> > > > > Running it with `taskset -c 1` prevents this issue from happening, which
> > > > > AFAICT, contributes even further towards this being a QEMU race condition.
> > > > > 
> > > > > I'm CC'ing Peter and Claudio (listed maintainers of aarch64), as this
> > > > > seems to limited to that target.  Any tips on what to do here?
> > > > 
> > > > I am hitting this on Travis, too, and I finally could reproduce
> > > > it locally,
> > > > 
> > > > The guest is still writing on the serial console, but nobody is
> > > > reading the data on the other side.  A VCPU thread is stuck
> > > > inside the EAGAIN/nanosleep loop at qemu_chr_write_buffer(),
> > > > holding the QEMU global lock.
> > > 
> > > Experimental fix below.
> > > 
> > > Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
> > > ---
> > >  python/qemu/__init__.py | 12 ++++++++----
> > >  1 file changed, 8 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/python/qemu/__init__.py b/python/qemu/__init__.py
> > > index 81d9657ec0..4a691f34da 100644
> > > --- a/python/qemu/__init__.py
> > > +++ b/python/qemu/__init__.py
> > > @@ -274,10 +274,6 @@ class QEMUMachine(object):
> > >  
> > >          self._qemu_log_path = None
> > >  
> > > -        if self._console_socket is not None:
> > > -            self._console_socket.close()
> > > -            self._console_socket = None
> > > -
> > >          if self._temp_dir is not None:
> > >              shutil.rmtree(self._temp_dir)
> > >              self._temp_dir = None
> > > @@ -336,6 +332,14 @@ class QEMUMachine(object):
> > >          """
> > >          Terminate the VM and clean up
> > >          """
> > > +
> > > +        # If we keep the console socket open, we may deadlock waiting
> > > +        # for QEMU to exit, while QEMU is waiting for the socket to
> > > +        # become writeable.
> > > +        if self._console_socket is not None:
> > > +            self._console_socket.close()
> > > +            self._console_socket = None
> > > +
> > 
> > Right, this is somewhat equivalent to the following "workaround":
> > 
> >    https://github.com/clebergnu/qemu/commit/e1713f3b91972ad57c089f276c54db3f3fa63423
> > 
> > I could not tell at the moment, though, if closing (or shutting down)
> > the console socket was the appropriate fix.
> > 
> > What I see is that Rick's commit pointed to by Lazlo is dated from
> > 2016, and it says that a virtio-console fix is necessary.  I'd imagine
> > that, if a fix was ever proposed, it'd have been incorporated in the
> > F29 kernel used in this test... and that this workaround/fix would
> > not be the best solution.
> 
> If I understood this correctly, fixing the guest driver wouldn't
> help here.  The commit mentioned by Laszlo (6ab3fc32ea64 "hw:
> replace most use of qemu_chr_fe_write with qemu_chr_fe_write_all")
> fixes data loss bugs but introduces the potential for deadlocks
> like the one we are seeing.
> 
> Unless we replace the existing ~30 qemu_chr_fe_write_all() calls
> in device code, we need to make sure we are constantly reading
> data from the console socket.

Yes, you must *always* have something reading from the chardev backend.
This really sucks, but it is preferrable to letting data end up in
/dev/null.

If this is being a problem for tests then consider it motivation to
fix the root cause problem and make the devices properly do async
I/O for chardevs.  Only a handful of them properly do this right
now.


Regards,
Daniel
Cleber Rosa June 10, 2019, 4:50 p.m. UTC | #13
On Mon, Jun 10, 2019 at 09:53:03AM +0100, Daniel P. Berrangé wrote:
> On Fri, Jun 07, 2019 at 03:58:57PM -0300, Eduardo Habkost wrote:
> > CCing Daniel, who wrote commit 6ab3fc32ea64.
> > 
> > On Fri, Jun 07, 2019 at 11:44:32AM -0400, Cleber Rosa wrote:
> > > On Fri, Jun 07, 2019 at 12:42:14AM -0300, Eduardo Habkost wrote:
> > > > On Fri, Jun 07, 2019 at 12:26:48AM -0300, Eduardo Habkost wrote:
> > > > > On Fri, Feb 01, 2019 at 11:10:31AM -0500, Cleber Rosa wrote:
> > > > > > 
> > > > > > 
> > > > > > On 1/31/19 4:26 PM, Cleber Rosa wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > On 1/31/19 3:21 PM, Cleber Rosa wrote:
> > > > > > >>
> > > > > > >>
> > > > > > >> On 1/31/19 3:02 PM, Wainer dos Santos Moschetta wrote:
> > > > > > >>>
> > > > > > >>> On 01/17/2019 04:56 PM, Cleber Rosa wrote:
> > > > > > >>>> Just like the previous tests, boots a Linux kernel on a aarch64 target
> > > > > > >>>> using the virt machine.
> > > > > > >>>>
> > > > > > >>>> One special option added is the CPU type, given that the kernel
> > > > > > >>>> selected fails to boot on the virt machine's default CPU (cortex-a15).
> > > > > > >>>>
> > > > > > >>>> Signed-off-by: Cleber Rosa <crosa@redhat.com>
> > > > > > >>>> ---
> > > > > > >>>>   .travis.yml                            |  2 +-
> > > > > > >>>>   tests/acceptance/boot_linux_console.py | 20 ++++++++++++++++++++
> > > > > > >>>>   2 files changed, 21 insertions(+), 1 deletion(-)
> > > > > > >>>>
> > > > > > >>>> diff --git a/.travis.yml b/.travis.yml
> > > > > > >>>> index 54100eea5a..595e8c0b6c 100644
> > > > > > >>>> --- a/.travis.yml
> > > > > > >>>> +++ b/.travis.yml
> > > > > > >>>> @@ -187,7 +187,7 @@ matrix:
> > > > > > >>>>         # Acceptance (Functional) tests
> > > > > > >>>>       - env:
> > > > > > >>>> -        - CONFIG="--python=/usr/bin/python3
> > > > > > >>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu"
> > > > > > >>>> +        - CONFIG="--python=/usr/bin/python3
> > > > > > >>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu"
> > > > > > >>>>
> > > > > > >>>>           - TEST_CMD="make check-acceptance"
> > > > > > >>>>         addons:
> > > > > > >>>>           apt:
> > > > > > >>>> diff --git a/tests/acceptance/boot_linux_console.py
> > > > > > >>>> b/tests/acceptance/boot_linux_console.py
> > > > > > >>>> index f3ccd23a7a..107700b517 100644
> > > > > > >>>> --- a/tests/acceptance/boot_linux_console.py
> > > > > > >>>> +++ b/tests/acceptance/boot_linux_console.py
> > > > > > >>>> @@ -138,3 +138,23 @@ class BootLinuxConsole(Test):
> > > > > > >>>>           self.vm.launch()
> > > > > > >>>>           console_pattern = 'Kernel command line: %s' %
> > > > > > >>>> kernel_command_line
> > > > > > >>>>           self.wait_for_console_pattern(console_pattern)
> > > > > > >>>> +
> > > > > > >>>> +    def test_aarch64_virt(self):
> > > > > > >>>
> > > > > > >>> That test case fails on my system (Fedora 29 x86_64). Avocado seems
> > > > > > >>> unable to kill the VM so it  reaches the timeout.
> > > > > > >>>
> > > > > > >>> I compiled QEMU with default configuration:
> > > > > > >>>
> > > > > > >>> $ configure --python=/usr/bin/python3 --target-list=x86_64-softmmu
> > > > > > >>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu)
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> Follows a snippet of the Avocado's job.log file:
> > > > > > >>> ----
> > > > > > >>> 2019-01-31 14:41:34,912 test             L0602 INFO | START
> > > > > > >>> 07-/root/src/qemu/tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_aarch64_virt
> > > > > > >>>
> > > > > > >>> 2019-01-31 14:41:34,912 test             L0298 DEBUG| DATA
> > > > > > >>> (filename=output.expected) => NOT FOUND (data sources: variant, test, file)
> > > > > > >>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS (key=arch,
> > > > > > >>> path=*, default=aarch64) => 'aarch64'
> > > > > > >>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS
> > > > > > >>> (key=qemu_bin, path=*, default=aarch64-softmmu/qemu-system-aarch64) =>
> > > > > > >>> 'aarch64-softmmu/qemu-system-aarch64'
> > > > > > >>> 2019-01-31 14:41:34,915 download         L0070 INFO | Fetching
> > > > > > >>> https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz
> > > > > > >>> -> /var/lib/avocado/data/cache/by_name/vmlinuz.3upct2pr
> > > > > > >>> 2019-01-31 14:41:35,490 download         L0054 DEBUG| Retrieved URL
> > > > > > >>> "https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz":
> > > > > > >>> content-length 8623423, date: "Thu, 31 Jan 2019 19:41:35 GMT",
> > > > > > >>> last-modified: "Sun, 21 Oct 2018 00:43:09 GMT"
> > > > > > >>> 2019-01-31 14:41:41,765 qemu             L0317 DEBUG| VM launch command:
> > > > > > >>> 'aarch64-softmmu/qemu-system-aarch64 -chardev
> > > > > > >>> socket,id=mon,path=/var/tmp/tmpizirkcud/qemu-32609-monitor.sock -mon
> > > > > > >>> chardev=mon,mode=control -display none -vga none -machine virt -chardev
> > > > > > >>> socket,id=console,path=/var/tmp/tmpizirkcud/qemu-32609-console.sock,server,nowait
> > > > > > >>> -serial chardev:console -cpu cortex-a53 -kernel
> > > > > > >>> /var/lib/avocado/data/cache/by_name/vmlinuz -append console=ttyAMA0'
> > > > > > >>> 2019-01-31 14:41:41,779 qmp              L0167 DEBUG| >>> {'execute':
> > > > > > >>> 'qmp_capabilities'}
> > > > > > >>> 2019-01-31 14:41:41,931 qmp              L0175 DEBUG| <<< {'return': {}}
> > > > > > >>> 2019-01-31 14:41:42,830 boot_linux_conso L0041 DEBUG| [    0.000000]
> > > > > > >>> Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> > > > > > >>>
> > > > > > >>> (...)
> > > > > > >>>
> > > > > > >>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
> > > > > > >>> Policy zone: DMA32
> > > > > > >>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    0.000000]
> > > > > > >>> Kernel command line: console=ttyAMA0
> > > > > > >>> 2019-01-31 14:41:42,833 qmp              L0167 DEBUG| >>> {'execute':
> > > > > > >>> 'quit'}
> > > > > > >>
> > > > > > >> Here, a QMP response like "<<< {'return': {}}" would be expected.
> > > > > > >>
> > > > > > >> Since I can not reproduce this on my system (or on Travis-CI jobs I've
> > > > > > >> sent), can you tell me on top of which commit you've applied these patches?
> > > > > > >>
> > > > > > > 
> > > > > > > I spoke too soon:
> > > > > > > 
> > > > > > > https://travis-ci.org/clebergnu/qemu/jobs/487121425#L3033
> > > > > > > 
> > > > > > > This looks like a recent regression, and I'm guessing it's not on the
> > > > > > > test's side.  I'll try to bisect it and let you know.
> > > > > > > 
> > > > > > 
> > > > > > On a fresh environment, I am able to get this reproduced on every 2 of
> > > > > > runs, more or less.  When I hit it, I attached GDB to it, and the
> > > > > > backtrace shows:
> > > > > > 
> > > > > > Thread debugging using libthread_db enabled]
> > > > > > Using host libthread_db library "/lib64/libthread_db.so.1".
> > > > > > warning: Loadable section ".note.gnu.property" outside of ELF segments
> > > > > > warning: Loadable section ".note.gnu.property" outside of ELF segments
> > > > > > __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
> > > > > > 103     2:      movl    %edx, %eax
> > > > > > (gdb) bt
> > > > > > #0  __lll_lock_wait () at
> > > > > > ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
> > > > > > #1  0x00007fc6ba1a2e09 in __GI___pthread_mutex_lock
> > > > > > (mutex=mutex@entry=0x5615a233d020 <qemu_global_mutex>) at
> > > > > > ../nptl/pthread_mutex_lock.c:80
> > > > > > #2  0x00005615a1bb7593 in qemu_mutex_lock_impl (mutex=0x5615a233d020
> > > > > > <qemu_global_mutex>, file=0x5615a1db2d4c "util/main-loop.c", line=236)
> > > > > > at util/qemu-thread-posix.c:66
> > > > > > #3  0x00005615a171125e in qemu_mutex_lock_iothread_impl
> > > > > > (file=file@entry=0x5615a1db2d4c "util/main-loop.c", line=line@entry=236)
> > > > > > at /home/cleber/src/qemu/cpus.c:1849
> > > > > > #4  0x00005615a1bb415d in os_host_main_loop_wait (timeout=<optimized
> > > > > > out>) at util/main-loop.c:236
> > > > > > #5  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
> > > > > > #6  0x00005615a18fdd39 in main_loop () at vl.c:1928
> > > > > > #7  0x00005615a16c9ee9 in main (argc=<optimized out>, argv=<optimized
> > > > > > out>, envp=<optimized out>) at vl.c:4665
> > > > > 
> > > > > Tip: run "thread apply all bt" so you can get a backtrace of all
> > > > > threads.
> > > > > 
> > > > > 
> > > > > > 
> > > > > > Running it with `taskset -c 1` prevents this issue from happening, which
> > > > > > AFAICT, contributes even further towards this being a QEMU race condition.
> > > > > > 
> > > > > > I'm CC'ing Peter and Claudio (listed maintainers of aarch64), as this
> > > > > > seems to limited to that target.  Any tips on what to do here?
> > > > > 
> > > > > I am hitting this on Travis, too, and I finally could reproduce
> > > > > it locally,
> > > > > 
> > > > > The guest is still writing on the serial console, but nobody is
> > > > > reading the data on the other side.  A VCPU thread is stuck
> > > > > inside the EAGAIN/nanosleep loop at qemu_chr_write_buffer(),
> > > > > holding the QEMU global lock.
> > > > 
> > > > Experimental fix below.
> > > > 
> > > > Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
> > > > ---
> > > >  python/qemu/__init__.py | 12 ++++++++----
> > > >  1 file changed, 8 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/python/qemu/__init__.py b/python/qemu/__init__.py
> > > > index 81d9657ec0..4a691f34da 100644
> > > > --- a/python/qemu/__init__.py
> > > > +++ b/python/qemu/__init__.py
> > > > @@ -274,10 +274,6 @@ class QEMUMachine(object):
> > > >  
> > > >          self._qemu_log_path = None
> > > >  
> > > > -        if self._console_socket is not None:
> > > > -            self._console_socket.close()
> > > > -            self._console_socket = None
> > > > -
> > > >          if self._temp_dir is not None:
> > > >              shutil.rmtree(self._temp_dir)
> > > >              self._temp_dir = None
> > > > @@ -336,6 +332,14 @@ class QEMUMachine(object):
> > > >          """
> > > >          Terminate the VM and clean up
> > > >          """
> > > > +
> > > > +        # If we keep the console socket open, we may deadlock waiting
> > > > +        # for QEMU to exit, while QEMU is waiting for the socket to
> > > > +        # become writeable.
> > > > +        if self._console_socket is not None:
> > > > +            self._console_socket.close()
> > > > +            self._console_socket = None
> > > > +
> > > 
> > > Right, this is somewhat equivalent to the following "workaround":
> > > 
> > >    https://github.com/clebergnu/qemu/commit/e1713f3b91972ad57c089f276c54db3f3fa63423
> > > 
> > > I could not tell at the moment, though, if closing (or shutting down)
> > > the console socket was the appropriate fix.
> > > 
> > > What I see is that Rick's commit pointed to by Lazlo is dated from
> > > 2016, and it says that a virtio-console fix is necessary.  I'd imagine
> > > that, if a fix was ever proposed, it'd have been incorporated in the
> > > F29 kernel used in this test... and that this workaround/fix would
> > > not be the best solution.
> > 
> > If I understood this correctly, fixing the guest driver wouldn't
> > help here.  The commit mentioned by Laszlo (6ab3fc32ea64 "hw:
> > replace most use of qemu_chr_fe_write with qemu_chr_fe_write_all")
> > fixes data loss bugs but introduces the potential for deadlocks
> > like the one we are seeing.
> > 
> > Unless we replace the existing ~30 qemu_chr_fe_write_all() calls
> > in device code, we need to make sure we are constantly reading
> > data from the console socket.
> 
> Yes, you must *always* have something reading from the chardev backend.
> This really sucks, but it is preferrable to letting data end up in
> /dev/null.
> 
> If this is being a problem for tests then consider it motivation to
> fix the root cause problem and make the devices properly do async
> I/O for chardevs.  Only a handful of them properly do this right
> now.
>
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Daniel,

Thanks for the explanation.  I don't feel completely capable of tackling
the root cause, but I'll give it a go anyway.

Eduardo,

Looks like we'll need this type of workaround now anyways.  As seen in
the reproducer I wrote, a test can cause this deadlock before it gets
to the test tearDown(), and either implicitly or explicitly gets to
QEMUMachine::shutdown().

I would also like to avoid exposing the user (test writer) to that
kind of situation though, which seems like what you tried to do.  But,
maybe the most honest approach right now is to make a explicit call to
read from and close the socket, paired with documentation/comment about
why that is being done?  This would be similar to the "XXX" comments on
6ab3fc32ea64.

Regards!
- Cleber.
diff mbox series

Patch

diff --git a/.travis.yml b/.travis.yml
index 54100eea5a..595e8c0b6c 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -187,7 +187,7 @@  matrix:
 
     # Acceptance (Functional) tests
     - env:
-        - CONFIG="--python=/usr/bin/python3 --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu"
+        - CONFIG="--python=/usr/bin/python3 --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu"
         - TEST_CMD="make check-acceptance"
       addons:
         apt:
diff --git a/tests/acceptance/boot_linux_console.py b/tests/acceptance/boot_linux_console.py
index f3ccd23a7a..107700b517 100644
--- a/tests/acceptance/boot_linux_console.py
+++ b/tests/acceptance/boot_linux_console.py
@@ -138,3 +138,23 @@  class BootLinuxConsole(Test):
         self.vm.launch()
         console_pattern = 'Kernel command line: %s' % kernel_command_line
         self.wait_for_console_pattern(console_pattern)
+
+    def test_aarch64_virt(self):
+        """
+        :avocado: tags=arch:aarch64
+        :avocado: tags=machine:virt
+        """
+        kernel_url = ('https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/'
+                      'releases/29/Server/aarch64/os/images/pxeboot/vmlinuz')
+        kernel_hash = '8c73e469fc6ea06a58dc83a628fc695b693b8493'
+        kernel_path = self.fetch_asset(kernel_url, asset_hash=kernel_hash)
+
+        self.vm.set_machine('virt')
+        self.vm.set_console()
+        kernel_command_line = 'console=ttyAMA0'
+        self.vm.add_args('-cpu', 'cortex-a53',
+                         '-kernel', kernel_path,
+                         '-append', kernel_command_line)
+        self.vm.launch()
+        console_pattern = 'Kernel command line: %s' % kernel_command_line
+        self.wait_for_console_pattern(console_pattern)