diff mbox series

[v4,4/4] meson: Warn when TCI is selected but TCG backend is available

Message ID 20210125144530.2837481-5-philmd@redhat.com (mailing list archive)
State New, archived
Headers show
Series meson: Try to clarify TCG / TCI options for new users | expand

Commit Message

Philippe Mathieu-Daudé Jan. 25, 2021, 2:45 p.m. UTC
Some new users get confused with 'TCG' and 'TCI', and enable TCI
support expecting to enable TCG.

Emit a warning when native TCG backend is available on the
host architecture, mentioning this is a suboptimal configuration.

Reviewed-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Suggested-by: Daniel Berrangé <berrange@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
---
 meson.build | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Daniel P. Berrangé Jan. 25, 2021, 4:46 p.m. UTC | #1
On Mon, Jan 25, 2021 at 03:45:30PM +0100, Philippe Mathieu-Daudé wrote:
> Some new users get confused with 'TCG' and 'TCI', and enable TCI
> support expecting to enable TCG.
> 
> Emit a warning when native TCG backend is available on the
> host architecture, mentioning this is a suboptimal configuration.
> 
> Reviewed-by: Stefan Weil <sw@weilnetz.de>
> Reviewed-by: Thomas Huth <thuth@redhat.com>
> Suggested-by: Daniel Berrangé <berrange@redhat.com>
> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
> ---
>  meson.build | 7 +++++++
>  1 file changed, 7 insertions(+)

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>


Regards,
Daniel
Daniel P. Berrangé Jan. 25, 2021, 4:47 p.m. UTC | #2
On Mon, Jan 25, 2021 at 03:45:30PM +0100, Philippe Mathieu-Daudé wrote:
> Some new users get confused with 'TCG' and 'TCI', and enable TCI
> support expecting to enable TCG.
> 
> Emit a warning when native TCG backend is available on the
> host architecture, mentioning this is a suboptimal configuration.
> 
> Reviewed-by: Stefan Weil <sw@weilnetz.de>
> Reviewed-by: Thomas Huth <thuth@redhat.com>

Nitpick, the text printed is completely rewritten from what they
reviewed, so I would probably have dropped their R-b for that
scenario.

> Suggested-by: Daniel Berrangé <berrange@redhat.com>
> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
> ---
>  meson.build | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/meson.build b/meson.build
> index 16b2560e7e7..f675c54e636 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -228,6 +228,13 @@
>      else
>        error('Unsupported CPU @0@, try --enable-tcg-interpreter'.format(cpu))
>      endif
> +  elif get_option('tcg_interpreter')
> +    warning('Use of the TCG interpretor is not recommended on this host')
> +    warning('architecture. There is a native TCG execution backend available')
> +    warning('which provides substantially better performance and reliability.')
> +    warning('It is strongly recommended to remove the --enable-tcg-interpreter')
> +    warning('configuration option on this architecture to use the native')
> +    warning('backend.')
>    endif
>    if get_option('tcg_interpreter')
>      tcg_arch = 'tci'
> -- 
> 2.26.2
> 

Regards,
Daniel
Philippe Mathieu-Daudé Jan. 25, 2021, 5:05 p.m. UTC | #3
On 1/25/21 5:47 PM, Daniel P. Berrangé wrote:
> On Mon, Jan 25, 2021 at 03:45:30PM +0100, Philippe Mathieu-Daudé wrote:
>> Some new users get confused with 'TCG' and 'TCI', and enable TCI
>> support expecting to enable TCG.
>>
>> Emit a warning when native TCG backend is available on the
>> host architecture, mentioning this is a suboptimal configuration.
>>
>> Reviewed-by: Stefan Weil <sw@weilnetz.de>
>> Reviewed-by: Thomas Huth <thuth@redhat.com>
> 
> Nitpick, the text printed is completely rewritten from what they
> reviewed, so I would probably have dropped their R-b for that
> scenario.

I thought about it, and assumed their review tag was for the logical
change of adding a warning, not particularly the warning content.

I agree this it would have been better to ask them to review again.
Next time I will reset the tags.

>> Suggested-by: Daniel Berrangé <berrange@redhat.com>
>> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
>> ---
>>  meson.build | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/meson.build b/meson.build
>> index 16b2560e7e7..f675c54e636 100644
>> --- a/meson.build
>> +++ b/meson.build
>> @@ -228,6 +228,13 @@
>>      else
>>        error('Unsupported CPU @0@, try --enable-tcg-interpreter'.format(cpu))
>>      endif
>> +  elif get_option('tcg_interpreter')
>> +    warning('Use of the TCG interpretor is not recommended on this host')
>> +    warning('architecture. There is a native TCG execution backend available')
>> +    warning('which provides substantially better performance and reliability.')
>> +    warning('It is strongly recommended to remove the --enable-tcg-interpreter')
>> +    warning('configuration option on this architecture to use the native')
>> +    warning('backend.')
>>    endif
>>    if get_option('tcg_interpreter')
>>      tcg_arch = 'tci'
>> -- 
>> 2.26.2
>>
> 
> Regards,
> Daniel
>
Stefan Weil Jan. 25, 2021, 6:58 p.m. UTC | #4
Am 25.01.21 um 18:05 schrieb Philippe Mathieu-Daudé:

> On 1/25/21 5:47 PM, Daniel P. Berrangé wrote:
>> On Mon, Jan 25, 2021 at 03:45:30PM +0100, Philippe Mathieu-Daudé wrote:
>>> Some new users get confused with 'TCG' and 'TCI', and enable TCI
>>> support expecting to enable TCG.
>>>
>>> Emit a warning when native TCG backend is available on the
>>> host architecture, mentioning this is a suboptimal configuration.
>>>
>>> Reviewed-by: Stefan Weil <sw@weilnetz.de>
>>> Reviewed-by: Thomas Huth <thuth@redhat.com>
>> Nitpick, the text printed is completely rewritten from what they
>> reviewed, so I would probably have dropped their R-b for that
>> scenario.
> I thought about it, and assumed their review tag was for the logical
> change of adding a warning, not particularly the warning content.
>
> I agree this it would have been better to ask them to review again.
> Next time I will reset the tags.


You are right, I would not have signed that new text and either used the 
original text (which was sufficient in my opinion) or used a different one:

Use of the TCG interpretor is not recommended on this host architecture for most users because there is a native TCG execution backend available which provides substantially better performance.

I have no evidence that TCI is less reliable than TCG, so I would not 
write that.

And there are people who have good reasons to use TCI. So why should I 
recommend them to stop that?

Stefan

>>> Suggested-by: Daniel Berrangé <berrange@redhat.com>
>>> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
>>> ---
>>>   meson.build | 7 +++++++
>>>   1 file changed, 7 insertions(+)
>>>
>>> diff --git a/meson.build b/meson.build
>>> index 16b2560e7e7..f675c54e636 100644
>>> --- a/meson.build
>>> +++ b/meson.build
>>> @@ -228,6 +228,13 @@
>>>       else
>>>         error('Unsupported CPU @0@, try --enable-tcg-interpreter'.format(cpu))
>>>       endif
>>> +  elif get_option('tcg_interpreter')
>>> +    warning('Use of the TCG interpretor is not recommended on this host')
>>> +    warning('architecture. There is a native TCG execution backend available')
>>> +    warning('which provides substantially better performance and reliability.')
>>> +    warning('It is strongly recommended to remove the --enable-tcg-interpreter')
>>> +    warning('configuration option on this architecture to use the native')
>>> +    warning('backend.')
Richard Henderson Jan. 25, 2021, 7:02 p.m. UTC | #5
On 1/25/21 8:58 AM, Stefan Weil wrote:
> I have no evidence that TCI is less reliable than TCG, so I would not write that.

It can't pass make check-tcg.


r~
Stefan Weil Jan. 25, 2021, 9:02 p.m. UTC | #6
Am 25.01.21 um 20:02 schrieb Richard Henderson:

> On 1/25/21 8:58 AM, Stefan Weil wrote:
>> I have no evidence that TCI is less reliable than TCG, so I would not write that.
> It can't pass make check-tcg.


Where does it fail? Maybe an expected timeout problem which can be 
solved by increasing the timeouts for TCI?

I have just run a local test of `make check-tcg` with native TCG and 
with TCI and did not see a difference. But I noticed that in both cases 
many tests show "skipped".

Stefan
Richard Henderson Jan. 25, 2021, 10:35 p.m. UTC | #7
On 1/25/21 11:02 AM, Stefan Weil wrote:
> Am 25.01.21 um 20:02 schrieb Richard Henderson:
> 
>> On 1/25/21 8:58 AM, Stefan Weil wrote:
>>> I have no evidence that TCI is less reliable than TCG, so I would not write
>>> that.
>> It can't pass make check-tcg.
> 
> 
> Where does it fail? Maybe an expected timeout problem which can be solved by
> increasing the timeouts for TCI?
> 
> I have just run a local test of `make check-tcg` with native TCG and with TCI
> and did not see a difference. But I noticed that in both cases many tests show
> "skipped".

You need to enable docker or podman for your development, so that you get all
of the cross-compilers.

Then:

  TEST    fcvt on arm
TODO ../qemu/tcg/tci.c:614: tcg_qemu_tb_exec()
../qemu/tcg/tci.c:614: tcg fatal error
qemu: uncaught target signal 11 (Segmentation fault) - core dumped

  TEST    float_convs on m68k
TODO ../qemu/tcg/tci.c:614: tcg_qemu_tb_exec()
../qemu/tcg/tci.c:614: tcg fatal error
qemu: uncaught target signal 11 (Segmentation fault) - core dumped

which is of course one of the TODO assertions.
It's positively criminal those still exist in the code.


r~
Daniel P. Berrangé Jan. 26, 2021, 10:34 a.m. UTC | #8
On Mon, Jan 25, 2021 at 09:02:38AM -1000, Richard Henderson wrote:
> On 1/25/21 8:58 AM, Stefan Weil wrote:
> > I have no evidence that TCI is less reliable than TCG, so I would not write that.
> 
> It can't pass make check-tcg.

It also doesn't have anyhere near the same level of automated or manual
testing than TCG does, so I don't think we can claim it is at the same
quality level. 

Regards,
Daniel
Stefan Weil Jan. 26, 2021, 11:40 a.m. UTC | #9
Am 25.01.21 um 23:35 schrieb Richard Henderson:
> On 1/25/21 11:02 AM, Stefan Weil wrote:
>> Am 25.01.21 um 20:02 schrieb Richard Henderson:
>>> On 1/25/21 8:58 AM, Stefan Weil wrote:
>>>> I have no evidence that TCI is less reliable than TCG, so I would not write
>>>> that.
>>> It can't pass make check-tcg.
>> Where does it fail? Maybe an expected timeout problem which can be solved by
>> increasing the timeouts for TCI?
>>
>> I have just run a local test of `make check-tcg` with native TCG and with TCI
>> and did not see a difference. But I noticed that in both cases many tests show
>> "skipped".
> You need to enable docker or podman for your development, so that you get all
> of the cross-compilers.
>
> Then:
>
>    TEST    fcvt on arm
> TODO ../qemu/tcg/tci.c:614: tcg_qemu_tb_exec()
> ../qemu/tcg/tci.c:614: tcg fatal error
> qemu: uncaught target signal 11 (Segmentation fault) - core dumped
>
>    TEST    float_convs on m68k
> TODO ../qemu/tcg/tci.c:614: tcg_qemu_tb_exec()
> ../qemu/tcg/tci.c:614: tcg fatal error
> qemu: uncaught target signal 11 (Segmentation fault) - core dumped
>
> which is of course one of the TODO assertions.
> It's positively criminal those still exist in the code.


I installed podman and repeated `make check-tcg`. The log file still 
shows 87 lines with "SKIPPED". There is also a gdb core dump, several 
warnings, but nothing related to TCI. Both tests cited above seem to 
work without a problem.

The complete log file is available from 
https://qemu.weilnetz.de/test/check-tcg.txt.

Daniel, regarding your comment: TCI has 100 % test coverage for the 
productive code lines. All code lines which were never tested raise an 
assertion, so can easily be identified (and fixed as soon as there is a 
test case which triggers such an assertion). The known deficits are 
speed, missing TCG opcodes, unimplemented TCG opcodes because of missing 
test cases and missing support for some host architectures.

Stefan
Alex Bennée Jan. 26, 2021, 5:24 p.m. UTC | #10
Stefan Weil <sw@weilnetz.de> writes:

> Am 25.01.21 um 23:35 schrieb Richard Henderson:
>> On 1/25/21 11:02 AM, Stefan Weil wrote:
>>> Am 25.01.21 um 20:02 schrieb Richard Henderson:
>>>> On 1/25/21 8:58 AM, Stefan Weil wrote:
>>>>> I have no evidence that TCI is less reliable than TCG, so I would not write
>>>>> that.
>>>> It can't pass make check-tcg.
>>> Where does it fail? Maybe an expected timeout problem which can be solved by
>>> increasing the timeouts for TCI?
>>>
>>> I have just run a local test of `make check-tcg` with native TCG and with TCI
>>> and did not see a difference. But I noticed that in both cases many tests show
>>> "skipped".
>> You need to enable docker or podman for your development, so that you get all
>> of the cross-compilers.
>>
>> Then:
>>
>>    TEST    fcvt on arm
>> TODO ../qemu/tcg/tci.c:614: tcg_qemu_tb_exec()
>> ../qemu/tcg/tci.c:614: tcg fatal error
>> qemu: uncaught target signal 11 (Segmentation fault) - core dumped
>>
>>    TEST    float_convs on m68k
>> TODO ../qemu/tcg/tci.c:614: tcg_qemu_tb_exec()
>> ../qemu/tcg/tci.c:614: tcg fatal error
>> qemu: uncaught target signal 11 (Segmentation fault) - core dumped
>>
>> which is of course one of the TODO assertions.
>> It's positively criminal those still exist in the code.
>
>
> I installed podman and repeated `make check-tcg`. The log file still 
> shows 87 lines with "SKIPPED". There is also a gdb core dump, several 
> warnings, but nothing related to TCI. Both tests cited above seem to 
> work without a problem.

I'm going to go out on a limb and guess you have commit:

  23a77b2d18 (build-system: clean up TCG/TCI configury)

which temporarily has the effect of disabling TCI. See

  Subject: Re: [PATCH v4 1/4] configure: Fix --enable-tcg-interpreter
  From: Paolo Bonzini <pbonzini@redhat.com>
  Message-ID: <2b8b6291-b54c-b285-ae38-21f067a8497d@redhat.com>
  Date: Mon, 25 Jan 2021 17:36:42 +0100

with that fix fixed I see the same failures as Richard:

  ./qemu-arm ./tests/tcg/arm-linux-user/fcvt > /dev/null
  TODO ../../tcg/tci.c:614: tcg_qemu_tb_exec()
  ../../tcg/tci.c:614: tcg fatal error
  qemu: uncaught target signal 11 (Segmentation fault) - core dumped
  fish: “./qemu-arm ./tests/tcg/arm-linu…” terminated by signal SIGSEGV (Address boundary error)

which does raise the question before today when was the last time anyone
attempted to run check-tcg on this?

> The complete log file is available from 
> https://qemu.weilnetz.de/test/check-tcg.txt.
>
> Daniel, regarding your comment: TCI has 100 % test coverage for the 
> productive code lines.

By what tests? The fact you don't hit asserts in your day to day testing
doesn't mean there are features missing that are easily tripped up or
that TCI got it right.

> All code lines which were never tested raise an 
> assertion, so can easily be identified (and fixed as soon as there is a 
> test case which triggers such an assertion). The known deficits are 
> speed, missing TCG opcodes, unimplemented TCG opcodes because of missing 
> test cases and missing support for some host architectures.

Passing check-tcg would be a minimum for me.
Stefan Weil Jan. 26, 2021, 7:44 p.m. UTC | #11
Am 26.01.21 um 18:24 schrieb Alex Bennée:
> I'm going to go out on a limb and guess you have commit:
>
>    23a77b2d18 (build-system: clean up TCG/TCI configury)
>
> which temporarily has the effect of disabling TCI. See
>
>    Subject: Re: [PATCH v4 1/4] configure: Fix --enable-tcg-interpreter
>    From: Paolo Bonzini <pbonzini@redhat.com>
>    Message-ID: <2b8b6291-b54c-b285-ae38-21f067a8497d@redhat.com>
>    Date: Mon, 25 Jan 2021 17:36:42 +0100
>
> with that fix fixed I see the same failures as Richard:
>
>    ./qemu-arm ./tests/tcg/arm-linux-user/fcvt > /dev/null
>    TODO ../../tcg/tci.c:614: tcg_qemu_tb_exec()
>    ../../tcg/tci.c:614: tcg fatal error
>    qemu: uncaught target signal 11 (Segmentation fault) - core dumped
>    fish: “./qemu-arm ./tests/tcg/arm-linu…” terminated by signal SIGSEGV (Address boundary error)
>
> which does raise the question before today when was the last time anyone
> attempted to run check-tcg on this?


Yes, I tested with latest git master and did not notice that 
--enable-tcg-interpreter was broken.

Paolo's patch fixed that.I could reproduce the fatal assertions which 
were all caused by the same missing TCG opcode implementation.

I have sent a trivial patch which adds that implementation and fixes all 
segmentation faults.


>> Daniel, regarding your comment: TCI has 100 % test coverage for the
>> productive code lines.
> By what tests? The fact you don't hit asserts in your day to day testing
> doesn't mean there are features missing that are easily tripped up or
> that TCI got it right.


I was not talking about the TODO assertions. When I wrote TCI, I only 
enabled and included code which was triggered by my testing - that's why 
I said the productive code lines have 100 % test coverage. TODO 
assertions are not productive code, but debug code which were made to 
detect new test cases. They were successful, too, because they were 
triggered by some tests in `make check-tcg`.


>> All code lines which were never tested raise an
>> assertion, so can easily be identified (and fixed as soon as there is a
>> test case which triggers such an assertion). The known deficits are
>> speed, missing TCG opcodes, unimplemented TCG opcodes because of missing
>> test cases and missing support for some host architectures.


As soon as I was aware of the new test cases, adding the missing TCG 
opcode implementation was not difficult.


> Passing check-tcg would be a minimum for me.


It should pass now unless you get timeouts for some tests. Please tell 
me if more TODO assertions are triggered by new tests.

Stefan
Paolo Bonzini Jan. 26, 2021, 8:07 p.m. UTC | #12
On 26/01/21 20:44, Stefan Weil wrote:
> 
> Yes, I tested with latest git master and did not notice that 
> --enable-tcg-interpreter was broken.
> 
> Paolo's patch fixed that.I could reproduce the fatal assertions which 
> were all caused by the same missing TCG opcode implementation.

My patch _broke_ that.  Richard fixed it.

Paolo
Stefan Weil Jan. 26, 2021, 8:10 p.m. UTC | #13
Am 26.01.21 um 21:07 schrieb Paolo Bonzini:

> On 26/01/21 20:44, Stefan Weil wrote:
>>
>> Yes, I tested with latest git master and did not notice that 
>> --enable-tcg-interpreter was broken.
>>
>> Paolo's patch fixed that.I could reproduce the fatal assertions which 
>> were all caused by the same missing TCG opcode implementation.
>
> My patch _broke_ that.  Richard fixed it.
>
> Paolo


Sorry for the confusion in my e-mail. Yes, you are right.

Stefan
Richard Henderson Jan. 26, 2021, 10:39 p.m. UTC | #14
On 1/26/21 9:44 AM, Stefan Weil wrote:
> I was not talking about the TODO assertions. When I wrote TCI, I only enabled
> and included code which was triggered by my testing - that's why I said the
> productive code lines have 100 % test coverage. TODO assertions are not
> productive code, but debug code which were made to detect new test cases. They
> were successful, too, because they were triggered by some tests in `make
> check-tcg`.

The TODO assertions are all bugs.

Any *real* dead code detection should have been done in
tcg/tci/tcg-target.c.inc.  What's interpreted in tcg/tci.c should be exactly
what is produced on the other side, and you are producing more than you are
consuming.

> It should pass now unless you get timeouts for some tests. Please tell me if
> more TODO assertions are triggered by new tests.

        case INDEX_op_ld8s_i32:
            TODO();
            break;

Can be triggered by

target/arm/translate-a64.c:1061:
        tcg_gen_ld8s_i64(tcg_dest, cpu_env, vect_off);
target/arm/translate-a64.c:1090:
        tcg_gen_ld8s_i32(tcg_dest, cpu_env, vect_off);
target/arm/translate.c:1210:
        tcg_gen_ld8s_i32(dest, cpu_env, off);

target/s390x/translate_vx.c.inc:81:
        tcg_gen_ld8s_i64(dst, cpu_env, offs);
target/s390x/translate_vx.c.inc:111:
        tcg_gen_ld8s_i32(dst, cpu_env, offs);

        case INDEX_op_ld16s_i32:
            TODO();
            break;

Can be triggered by

target/arm/translate-a64.c:1064:
        tcg_gen_ld16s_i64(tcg_dest, cpu_env, vect_off);
target/arm/translate-a64.c:1093:
        tcg_gen_ld16s_i32(tcg_dest, cpu_env, vect_off);
target/arm/translate.c:1216:
        tcg_gen_ld16s_i32(dest, cpu_env, off);
target/s390x/translate_vx.c.inc:84:
        tcg_gen_ld16s_i64(dst, cpu_env, offs);
target/s390x/translate_vx.c.inc:114:
        tcg_gen_ld16s_i32(dst, cpu_env, offs);

All of which are target vector instructions.
I'm sure it would be trivial to whip up test cases for them, but I don't see
that as my job.

Please maintain this code properly or give it up.


r~
Stefan Weil Jan. 27, 2021, 6:53 a.m. UTC | #15
Am 26.01.21 um 23:39 schrieb Richard Henderson:

> On 1/26/21 9:44 AM, Stefan Weil wrote:
>> I was not talking about the TODO assertions. When I wrote TCI, I only enabled
>> and included code which was triggered by my testing - that's why I said the
>> productive code lines have 100 % test coverage. TODO assertions are not
>> productive code, but debug code which were made to detect new test cases. They
>> were successful, too, because they were triggered by some tests in `make
>> check-tcg`.
> The TODO assertions are all bugs.
>
> Any *real* dead code detection should have been done in
> tcg/tci/tcg-target.c.inc.  What's interpreted in tcg/tci.c should be exactly
> what is produced on the other side, and you are producing more than you are
> consuming.


Unless the TCG opcodes in tcg/tci/tcg-target.c.inc are used in real-live 
scenarios, they are dead code, too.

Writing a test case which produces them directly (not for some real 
architecture) is not a real-live scenario.

And the remaining TODO assertions are a good indicator that the current 
tests are incomplete for native TCG because they obviously don't cover 
all TCG opcodes.

Stefan
Daniel P. Berrangé Jan. 27, 2021, 10:02 a.m. UTC | #16
On Tue, Jan 26, 2021 at 05:24:10PM +0000, Alex Bennée wrote:
> 
> Stefan Weil <sw@weilnetz.de> writes:
> 
> > Am 25.01.21 um 23:35 schrieb Richard Henderson:
> >> On 1/25/21 11:02 AM, Stefan Weil wrote:
> >>> Am 25.01.21 um 20:02 schrieb Richard Henderson:
> >>>> On 1/25/21 8:58 AM, Stefan Weil wrote:
> >>>>> I have no evidence that TCI is less reliable than TCG, so I would not write
> >>>>> that.
> >>>> It can't pass make check-tcg.
> >>> Where does it fail? Maybe an expected timeout problem which can be solved by
> >>> increasing the timeouts for TCI?
> >>>
> >>> I have just run a local test of `make check-tcg` with native TCG and with TCI
> >>> and did not see a difference. But I noticed that in both cases many tests show
> >>> "skipped".
> >> You need to enable docker or podman for your development, so that you get all
> >> of the cross-compilers.
> >>
> >> Then:
> >>
> >>    TEST    fcvt on arm
> >> TODO ../qemu/tcg/tci.c:614: tcg_qemu_tb_exec()
> >> ../qemu/tcg/tci.c:614: tcg fatal error
> >> qemu: uncaught target signal 11 (Segmentation fault) - core dumped
> >>
> >>    TEST    float_convs on m68k
> >> TODO ../qemu/tcg/tci.c:614: tcg_qemu_tb_exec()
> >> ../qemu/tcg/tci.c:614: tcg fatal error
> >> qemu: uncaught target signal 11 (Segmentation fault) - core dumped
> >>
> >> which is of course one of the TODO assertions.
> >> It's positively criminal those still exist in the code.
> >
> >
> > I installed podman and repeated `make check-tcg`. The log file still 
> > shows 87 lines with "SKIPPED". There is also a gdb core dump, several 
> > warnings, but nothing related to TCI. Both tests cited above seem to 
> > work without a problem.
> 
> I'm going to go out on a limb and guess you have commit:
> 
>   23a77b2d18 (build-system: clean up TCG/TCI configury)
> 
> which temporarily has the effect of disabling TCI. See
> 
>   Subject: Re: [PATCH v4 1/4] configure: Fix --enable-tcg-interpreter
>   From: Paolo Bonzini <pbonzini@redhat.com>
>   Message-ID: <2b8b6291-b54c-b285-ae38-21f067a8497d@redhat.com>
>   Date: Mon, 25 Jan 2021 17:36:42 +0100
> 
> with that fix fixed I see the same failures as Richard:
> 
>   ./qemu-arm ./tests/tcg/arm-linux-user/fcvt > /dev/null
>   TODO ../../tcg/tci.c:614: tcg_qemu_tb_exec()
>   ../../tcg/tci.c:614: tcg fatal error
>   qemu: uncaught target signal 11 (Segmentation fault) - core dumped
>   fish: “./qemu-arm ./tests/tcg/arm-linu…” terminated by signal SIGSEGV (Address boundary error)
> 
> which does raise the question before today when was the last time anyone
> attempted to run check-tcg on this?
> 
> > The complete log file is available from 
> > https://qemu.weilnetz.de/test/check-tcg.txt.
> >
> > Daniel, regarding your comment: TCI has 100 % test coverage for the 
> > productive code lines.
> 
> By what tests? The fact you don't hit asserts in your day to day testing
> doesn't mean there are features missing that are easily tripped up or
> that TCI got it right.
> 
> > All code lines which were never tested raise an 
> > assertion, so can easily be identified (and fixed as soon as there is a 
> > test case which triggers such an assertion). The known deficits are 
> > speed, missing TCG opcodes, unimplemented TCG opcodes because of missing 
> > test cases and missing support for some host architectures.
> 
> Passing check-tcg would be a minimum for me.

Passing check-tcg *in gitlab CI* would be the minimum to consider
it on a par with TCG.

The lack of automated GitLab CI for TCI is a reason my proposed wording
described TCI as less reliable than native TCG. We can't claim it has
equivalent reliability unless we have equiv automated testing of TCI.

Regards,
Daniel
Alex Bennée Jan. 27, 2021, 12:34 p.m. UTC | #17
Daniel P. Berrangé <berrange@redhat.com> writes:

> On Tue, Jan 26, 2021 at 05:24:10PM +0000, Alex Bennée wrote:
>> 
>> Stefan Weil <sw@weilnetz.de> writes:
>> 
>> > Am 25.01.21 um 23:35 schrieb Richard Henderson:
>> >> On 1/25/21 11:02 AM, Stefan Weil wrote:
>> >>> Am 25.01.21 um 20:02 schrieb Richard Henderson:
>> >>>> On 1/25/21 8:58 AM, Stefan Weil wrote:
>> >>>>> I have no evidence that TCI is less reliable than TCG, so I would not write
>> >>>>> that.
>> >>>> It can't pass make check-tcg.
>> >>> Where does it fail? Maybe an expected timeout problem which can be solved by
>> >>> increasing the timeouts for TCI?
>> >>>
>> >>> I have just run a local test of `make check-tcg` with native TCG and with TCI
>> >>> and did not see a difference. But I noticed that in both cases many tests show
>> >>> "skipped".
>> >> You need to enable docker or podman for your development, so that you get all
>> >> of the cross-compilers.
>> >>
>> >> Then:
>> >>
>> >>    TEST    fcvt on arm
>> >> TODO ../qemu/tcg/tci.c:614: tcg_qemu_tb_exec()
>> >> ../qemu/tcg/tci.c:614: tcg fatal error
>> >> qemu: uncaught target signal 11 (Segmentation fault) - core dumped
>> >>
>> >>    TEST    float_convs on m68k
>> >> TODO ../qemu/tcg/tci.c:614: tcg_qemu_tb_exec()
>> >> ../qemu/tcg/tci.c:614: tcg fatal error
>> >> qemu: uncaught target signal 11 (Segmentation fault) - core dumped
>> >>
>> >> which is of course one of the TODO assertions.
>> >> It's positively criminal those still exist in the code.
>> >
>> >
>> > I installed podman and repeated `make check-tcg`. The log file still 
>> > shows 87 lines with "SKIPPED". There is also a gdb core dump, several 
>> > warnings, but nothing related to TCI. Both tests cited above seem to 
>> > work without a problem.
>> 
>> I'm going to go out on a limb and guess you have commit:
>> 
>>   23a77b2d18 (build-system: clean up TCG/TCI configury)
>> 
>> which temporarily has the effect of disabling TCI. See
>> 
>>   Subject: Re: [PATCH v4 1/4] configure: Fix --enable-tcg-interpreter
>>   From: Paolo Bonzini <pbonzini@redhat.com>
>>   Message-ID: <2b8b6291-b54c-b285-ae38-21f067a8497d@redhat.com>
>>   Date: Mon, 25 Jan 2021 17:36:42 +0100
>> 
>> with that fix fixed I see the same failures as Richard:
>> 
>>   ./qemu-arm ./tests/tcg/arm-linux-user/fcvt > /dev/null
>>   TODO ../../tcg/tci.c:614: tcg_qemu_tb_exec()
>>   ../../tcg/tci.c:614: tcg fatal error
>>   qemu: uncaught target signal 11 (Segmentation fault) - core dumped
>>   fish: “./qemu-arm ./tests/tcg/arm-linu…” terminated by signal SIGSEGV (Address boundary error)
>> 
>> which does raise the question before today when was the last time anyone
>> attempted to run check-tcg on this?
>> 
>> > The complete log file is available from 
>> > https://qemu.weilnetz.de/test/check-tcg.txt.
>> >
>> > Daniel, regarding your comment: TCI has 100 % test coverage for the 
>> > productive code lines.
>> 
>> By what tests? The fact you don't hit asserts in your day to day testing
>> doesn't mean there are features missing that are easily tripped up or
>> that TCI got it right.
>> 
>> > All code lines which were never tested raise an 
>> > assertion, so can easily be identified (and fixed as soon as there is a 
>> > test case which triggers such an assertion). The known deficits are 
>> > speed, missing TCG opcodes, unimplemented TCG opcodes because of missing 
>> > test cases and missing support for some host architectures.
>> 
>> Passing check-tcg would be a minimum for me.
>
> Passing check-tcg *in gitlab CI* would be the minimum to consider
> it on a par with TCG.
>
> The lack of automated GitLab CI for TCI is a reason my proposed wording
> described TCI as less reliable than native TCG. We can't claim it has
> equivalent reliability unless we have equiv automated testing of TCI.

I should point out that check-tcg is hardly a comprehensive test suite.
Most of our instruction testing for example tends to be done with RISU.
Any program that attempts to use vector instructions is likely to come a
cropper with TCI. I guess we don't even attempt to run check-acceptance
due to speed issues but it would be interesting to see how far it gets.

>
> Regards,
> Daniel
Richard Henderson Jan. 27, 2021, 5:19 p.m. UTC | #18
On 1/26/21 8:53 PM, Stefan Weil wrote:
> And the remaining TODO assertions are a good indicator that the current tests
> are incomplete for native TCG because they obviously don't cover all TCG opcodes.

If the symbol appears in target/, then the opcode can be produced.  I've just
shown you how for 2 of them, via 2 different target architectures.


r~
Alex Bennée Jan. 27, 2021, 7:52 p.m. UTC | #19
Stefan Weil <sw@weilnetz.de> writes:

> Am 26.01.21 um 23:39 schrieb Richard Henderson:
>
>> On 1/26/21 9:44 AM, Stefan Weil wrote:
>>> I was not talking about the TODO assertions. When I wrote TCI, I only enabled
>>> and included code which was triggered by my testing - that's why I said the
>>> productive code lines have 100 % test coverage. TODO assertions are not
>>> productive code, but debug code which were made to detect new test cases. They
>>> were successful, too, because they were triggered by some tests in `make
>>> check-tcg`.
>> The TODO assertions are all bugs.
>>
>> Any *real* dead code detection should have been done in
>> tcg/tci/tcg-target.c.inc.  What's interpreted in tcg/tci.c should be exactly
>> what is produced on the other side, and you are producing more than you are
>> consuming.
>
>
> Unless the TCG opcodes in tcg/tci/tcg-target.c.inc are used in real-live 
> scenarios, they are dead code, too.

For example - debian-buster (arm64) running ffmpeg:

  alex.bennee@8cd150a4b35d:~/lsrc/qemu.git/builds/all.tci$ ./qemu-aarch64 /usr/bin/ffmpeg -i theora.mkv theora.webm
  TODO ../../tcg/tci.c:882: tcg_qemu_tb_exec()
  ../../tcg/tci.c:882: tcg fatal error
  qemu: uncaught target signal 11 (Segmentation fault) - core dumped
  Segmentation fault (core dumped)

> Writing a test case which produces them directly (not for some real 
> architecture) is not a real-live scenario.
>
> And the remaining TODO assertions are a good indicator that the current 
> tests are incomplete for native TCG because they obviously don't cover 
> all TCG opcodes.

That's because check-tcg isn't a comprehensive test suite and expecting
it to be misses the point of it. It was added to make it easy to add new
test cases and remove some of the burden off maintainers having their
own zoo of test binaries. It has slowly grown as directed test cases
were written while bug hunting and sometimes when new features where
added. It will never be a comprehensive exercising of the CPU emulation
although some architectures have more coverage than others. For example
MIPs has a bunch of ISA level tests as part of check-tcg but most of the
ARM ISA validation is done externally using the RISU random instruction
testing tool.

Besides you've just argued writing a test case that targets missing
functionality in TCI would somehow be cheating as it's not a "real-live"
scenario.

I don't mind either way - the fact that TCI is useful to people is cool
and more power to them. But lets not pretend it is a fully functional
and maintained backend because it has obviously got some major holes. If
it ends up being a drag on efforts to maintain and improve the TCG then
we have to question why we are keeping it in. Being able to run
emulation on esoteric hardware without a real backend is a party trick
at best. The other use-cases that have been mentioned could be solved
with investing some effort in the rest of the TCG code.
Stefan Weil Jan. 27, 2021, 8:49 p.m. UTC | #20
Am 27.01.21 um 20:52 schrieb Alex Bennée:

> For example - debian-buster (arm64) running ffmpeg:
>
>    alex.bennee@8cd150a4b35d:~/lsrc/qemu.git/builds/all.tci$ ./qemu-aarch64 /usr/bin/ffmpeg -i theora.mkv theora.webm
>    TODO ../../tcg/tci.c:882: tcg_qemu_tb_exec()
>    ../../tcg/tci.c:882: tcg fatal error
>    qemu: uncaught target signal 11 (Segmentation fault) - core dumped
>    Segmentation fault (core dumped)


Thanks. All I tried to say is that I prefer to replace those TODO 
statements by working code as soon as there was a case which triggers 
them. Most of those TODO statements are very easy to implement, so 
anyone can add them when he/she detects a missing one. If I get 
information about a scenario which triggers a missing TODO, I'll fix 
that of course. I just don't want to add that missing code blindly.

Using `make check-tcg` helped finding and fixing one of them, future 
improved CI checks can find more, and so can examples like the one 
above. The error message tci.c:882 is INDEX_op_ld8s_i64 
(https://github.com/qemu/qemu/blob/master/tcg/tci.c#L882). The missing 
code is nearly identical to the existing code for INDEX_op_ld8u_i64, but 
with *(int8_t *) instead of *(uint8_t *), so maybe you can try that and 
confirm whether it fixes the reported problem. Otherwise I'll try to 
reproduce it with any mkv file.

I recently tried running tesseract with qemu-x86_64 because I had 
expected that it might trigger some unimplemented TCG opcodes. Instead 
it showed a general problem for native TCG: qemu-x86_64 allocates too 
much memory for tesseract and gets killed by the Linux kernel OOM handler.

Regards,

Stefan
Alex Bennée Jan. 27, 2021, 9:47 p.m. UTC | #21
Stefan Weil <sw@weilnetz.de> writes:

> Am 27.01.21 um 20:52 schrieb Alex Bennée:
>
>> For example - debian-buster (arm64) running ffmpeg:
>>
>>    alex.bennee@8cd150a4b35d:~/lsrc/qemu.git/builds/all.tci$ ./qemu-aarch64 /usr/bin/ffmpeg -i theora.mkv theora.webm
>>    TODO ../../tcg/tci.c:882: tcg_qemu_tb_exec()
>>    ../../tcg/tci.c:882: tcg fatal error
>>    qemu: uncaught target signal 11 (Segmentation fault) - core dumped
>>    Segmentation fault (core dumped)
>
>
> Thanks. All I tried to say is that I prefer to replace those TODO 
> statements by working code as soon as there was a case which triggers 
> them. Most of those TODO statements are very easy to implement, so 
> anyone can add them when he/she detects a missing one. If I get 
> information about a scenario which triggers a missing TODO, I'll fix 
> that of course. I just don't want to add that missing code blindly.

Your just going to end up playing wack-a-mole:

  TODO ../../tcg/tci.c:620: tcg_qemu_tb_exec()me=00:00:00.00 bitrate=N/A speed=   0x
  ../../tcg/tci.c:620: tcg fatal error
  qemu: uncaught target signal 11 (Segmentation fault) - core dumped
  Segmentation fault (core dumped)

> Using `make check-tcg` helped finding and fixing one of them, future 
> improved CI checks can find more, and so can examples like the one 
> above. The error message tci.c:882 is INDEX_op_ld8s_i64 
> (https://github.com/qemu/qemu/blob/master/tcg/tci.c#L882). The missing 
> code is nearly identical to the existing code for INDEX_op_ld8u_i64, but 
> with *(int8_t *) instead of *(uint8_t *), so maybe you can try that and 
> confirm whether it fixes the reported problem. Otherwise I'll try to 
> reproduce it with any mkv file.

ffmpeg is a good application for working out the SIMD code because it
features quite a lot of optimised code for each architecture.

> I recently tried running tesseract with qemu-x86_64 because I had 
> expected that it might trigger some unimplemented TCG opcodes.

qemu-x86-64 is a poor choice as a relatively under maintained front-end
doesn't emulate a particularly new CPU or take advantage of the new TCG
features. ARM64 is pretty good because the default cpu for linux-user is
CPU max which a) enables all ISA features we have and b) exposes them
fairly easily to guest detection routines which probe feature registers.

> Instead 
> it showed a general problem for native TCG: qemu-x86_64 allocates too 
> much memory for tesseract and gets killed by the Linux kernel OOM
> handler.

Do you have a command line? That sounds like something that should be
fixed.

>
> Regards,
>
> Stefan
Stefan Weil Jan. 28, 2021, 2:49 a.m. UTC | #22
Am 27.01.21 um 22:47 schrieb Alex Bennée:

> Your just going to end up playing wack-a-mole:
>
>    TODO ../../tcg/tci.c:620: tcg_qemu_tb_exec()me=00:00:00.00 bitrate=N/A speed=   0x
>    ../../tcg/tci.c:620: tcg fatal error
>    qemu: uncaught target signal 11 (Segmentation fault) - core dumped
>    Segmentation fault (core dumped)


That's INDEX_op_ld16s_i32. Only five levels left in the play as soon as 
that is implemented, too.

Thanks,

Stefan
Stefan Weil Jan. 28, 2021, 6:51 a.m. UTC | #23
Am 27.01.21 um 22:47 schrieb Alex Bennée:

> Stefan Weil<sw@weilnetz.de>  writes:
>> I recently tried running tesseract with qemu-x86_64 because I had
>> expected that it might trigger some unimplemented TCG opcodes.
> qemu-x86-64 is a poor choice as a relatively under maintained front-end
> doesn't emulate a particularly new CPU or take advantage of the new TCG
> features. ARM64 is pretty good because the default cpu for linux-user is
> CPU max which a) enables all ISA features we have and b) exposes them
> fairly easily to guest detection routines which probe feature registers.
>
>> Instead
>> it showed a general problem for native TCG: qemu-x86_64 allocates too
>> much memory for tesseract and gets killed by the Linux kernel OOM
>> handler.
> Do you have a command line? That sounds like something that should be
> fixed.


The problem occurred with a locally built tesseract, but I now found 
that it is more general.

Any program which was compiled with address sanitizer uses huge virtual 
memory (TB) right at the start. QEMU user mode tries to allocate that 
memory until it is killed by the Linux kernel OOM handler.

A simple hello program compiled with "gcc -fsanitize=address hello.c" is 
sufficient to show the problem. Just run it with "qemu-x86_64 a.out".

I did not test but expect the same problem for other architectures, too, 
unless their VM is more limited.

Regards,

Stefan
Richard Henderson Jan. 28, 2021, 8:29 a.m. UTC | #24
On 1/27/21 8:51 PM, Stefan Weil wrote:
> The problem occurred with a locally built tesseract, but I now found that it is
> more general.
> 
> Any program which was compiled with address sanitizer uses huge virtual memory
> (TB) right at the start. QEMU user mode tries to allocate that memory until it
> is killed by the Linux kernel OOM handler.

Yes, we have an open bug for that.  It is not a trivial problem.

https://bugs.launchpad.net/qemu/+bug/1898011


r~
diff mbox series

Patch

diff --git a/meson.build b/meson.build
index 16b2560e7e7..f675c54e636 100644
--- a/meson.build
+++ b/meson.build
@@ -228,6 +228,13 @@ 
     else
       error('Unsupported CPU @0@, try --enable-tcg-interpreter'.format(cpu))
     endif
+  elif get_option('tcg_interpreter')
+    warning('Use of the TCG interpretor is not recommended on this host')
+    warning('architecture. There is a native TCG execution backend available')
+    warning('which provides substantially better performance and reliability.')
+    warning('It is strongly recommended to remove the --enable-tcg-interpreter')
+    warning('configuration option on this architecture to use the native')
+    warning('backend.')
   endif
   if get_option('tcg_interpreter')
     tcg_arch = 'tci'