diff mbox series

tb_flush() calls causing long Windows XP boot times

Message ID BCB8773B-FC54-4C25-9B60-92C263165D38@gmail.com (mailing list archive)
State New, archived
Headers show
Series tb_flush() calls causing long Windows XP boot times | expand

Commit Message

Programmingkid June 10, 2021, 12:59 p.m. UTC
Hi Richard,

There is a function called breakpoint_invalidate() in cpu.c that calls a function called tb_flush(). I have determined that this call is being made over 200,000 times when Windows XP boots. Disabling this function makes Windows XP boot way faster than before. The time went down from around 3 minutes to 20 seconds when I applied the patch below. 

After I applied the patch I ran several tests in my VM's to see if anything broke. I could not find any problems. Here is the list my VM's I tested:

Mac OS 10.8 in qemu-system-x86_64
Windows 7 in qemu-system-x86_64
Windows XP in qemu-system-i386
Mac OS 10.4 in qemu-system-ppc

I would be happy if the patch below was accepted but I would like to know your thoughts.

Thank you.


---
 cpu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Peter Maydell June 10, 2021, 1:14 p.m. UTC | #1
On Thu, 10 Jun 2021 at 14:02, Programmingkid <programmingkidx@gmail.com> wrote:
>
> Hi Richard,
>
> There is a function called breakpoint_invalidate() in cpu.c that calls a function called tb_flush(). I have determined that this call is being made over 200,000 times when Windows XP boots. Disabling this function makes Windows XP boot way faster than before. The time went down from around 3 minutes to 20 seconds when I applied the patch below.
>
> After I applied the patch I ran several tests in my VM's to see if anything broke. I could not find any problems. Here is the list my VM's I tested:
>
> Mac OS 10.8 in qemu-system-x86_64
> Windows 7 in qemu-system-x86_64
> Windows XP in qemu-system-i386
> Mac OS 10.4 in qemu-system-ppc
>
> I would be happy if the patch below was accepted but I would like to know your thoughts.

>  cpu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/cpu.c b/cpu.c
> index bfbe5a66f9..297c2e4281 100644
> --- a/cpu.c
> +++ b/cpu.c
> @@ -253,7 +253,7 @@ static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
>       * Flush the whole TB cache to force re-translation of such TBs.
>       * This is heavyweight, but we're debugging anyway.
>       */
> -    tb_flush(cpu);
> +    /* tb_flush(cpu); */
>  }
>  #endif

The patch is clearly wrong -- this function is called when a CPU breakpoint
is added or removed, and we *must* drop generated code which either
(a) includes code to take the breakpoint exception and now should not
or (b) doesn't include code to take the breakpoint exception and now should.
Otherwise we will incorrectly take/not take a breakpoint exception when
that stale code is executed.

As the comment notes, the assumption is that we won't be adding and
removing breakpoints except when we're debugging and therefore
performance is not critical. Windows XP is clearly doing something
we weren't expecting, so we should ideally have a look at whether
we can be a bit more efficient about not throwing the whole TB
cache away.

thanks
-- PMM
Mark Cave-Ayland June 10, 2021, 1:24 p.m. UTC | #2
On 10/06/2021 14:14, Peter Maydell wrote:

> On Thu, 10 Jun 2021 at 14:02, Programmingkid <programmingkidx@gmail.com> wrote:
>>
>> Hi Richard,
>>
>> There is a function called breakpoint_invalidate() in cpu.c that calls a function called tb_flush(). I have determined that this call is being made over 200,000 times when Windows XP boots. Disabling this function makes Windows XP boot way faster than before. The time went down from around 3 minutes to 20 seconds when I applied the patch below.
>>
>> After I applied the patch I ran several tests in my VM's to see if anything broke. I could not find any problems. Here is the list my VM's I tested:
>>
>> Mac OS 10.8 in qemu-system-x86_64
>> Windows 7 in qemu-system-x86_64
>> Windows XP in qemu-system-i386
>> Mac OS 10.4 in qemu-system-ppc
>>
>> I would be happy if the patch below was accepted but I would like to know your thoughts.
> 
>>   cpu.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/cpu.c b/cpu.c
>> index bfbe5a66f9..297c2e4281 100644
>> --- a/cpu.c
>> +++ b/cpu.c
>> @@ -253,7 +253,7 @@ static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
>>        * Flush the whole TB cache to force re-translation of such TBs.
>>        * This is heavyweight, but we're debugging anyway.
>>        */
>> -    tb_flush(cpu);
>> +    /* tb_flush(cpu); */
>>   }
>>   #endif
> 
> The patch is clearly wrong -- this function is called when a CPU breakpoint
> is added or removed, and we *must* drop generated code which either
> (a) includes code to take the breakpoint exception and now should not
> or (b) doesn't include code to take the breakpoint exception and now should.
> Otherwise we will incorrectly take/not take a breakpoint exception when
> that stale code is executed.
> 
> As the comment notes, the assumption is that we won't be adding and
> removing breakpoints except when we're debugging and therefore
> performance is not critical. Windows XP is clearly doing something
> we weren't expecting, so we should ideally have a look at whether
> we can be a bit more efficient about not throwing the whole TB
> cache away.

FWIW this was reported a while back on Launchpad as 
https://bugs.launchpad.net/qemu/+bug/1883593 where the performance regression was 
traced back to:

commit b55f54bc965607c45b5010a107a792ba333ba654
Author: Max Filippov <jcmvbkbc@gmail.com>
Date:   Wed Nov 27 14:06:01 2019 -0800

     exec: flush CPU TB cache in breakpoint_invalidate

     When a breakpoint is inserted at location for which there's currently no
     virtual to physical translation no action is taken on CPU TB cache. If a
     TB for that virtual address already exists but is not visible ATM the
     breakpoint won't be hit next time an instruction at that address will be
     executed.

     Flush entire CPU TB cache in breakpoint_invalidate to force
     re-translation of all TBs for the breakpoint address.

     This change fixes the following scenario:
     - linux user application is running
     - a breakpoint is inserted from QEMU gdbstub for a user address that is
       not currently present in the target CPU TLB
     - an instruction at that address is executed, but the external debugger
       doesn't get control.

     Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
     Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
     Message-Id: <20191127220602.10827-2-jcmvbkbc@gmail.com>
     Signed-off-by: Richard Henderson <richard.henderson@linaro.org>


Perhaps Windows XP is constantly executing some kind of breakpoint instruction or 
updating some CPU debug registers during boot?


ATB,

Mark.
Programmingkid June 10, 2021, 1:38 p.m. UTC | #3
> On Jun 10, 2021, at 9:14 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> 
> On Thu, 10 Jun 2021 at 14:02, Programmingkid <programmingkidx@gmail.com> wrote:
>> 
>> Hi Richard,
>> 
>> There is a function called breakpoint_invalidate() in cpu.c that calls a function called tb_flush(). I have determined that this call is being made over 200,000 times when Windows XP boots. Disabling this function makes Windows XP boot way faster than before. The time went down from around 3 minutes to 20 seconds when I applied the patch below.
>> 
>> After I applied the patch I ran several tests in my VM's to see if anything broke. I could not find any problems. Here is the list my VM's I tested:
>> 
>> Mac OS 10.8 in qemu-system-x86_64
>> Windows 7 in qemu-system-x86_64
>> Windows XP in qemu-system-i386
>> Mac OS 10.4 in qemu-system-ppc
>> 
>> I would be happy if the patch below was accepted but I would like to know your thoughts.
> 
>> cpu.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/cpu.c b/cpu.c
>> index bfbe5a66f9..297c2e4281 100644
>> --- a/cpu.c
>> +++ b/cpu.c
>> @@ -253,7 +253,7 @@ static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
>>      * Flush the whole TB cache to force re-translation of such TBs.
>>      * This is heavyweight, but we're debugging anyway.
>>      */
>> -    tb_flush(cpu);
>> +    /* tb_flush(cpu); */
>> }
>> #endif
> 
> The patch is clearly wrong -- this function is called when a CPU breakpoint
> is added or removed, and we *must* drop generated code which either
> (a) includes code to take the breakpoint exception and now should not
> or (b) doesn't include code to take the breakpoint exception and now should.
> Otherwise we will incorrectly take/not take a breakpoint exception when
> that stale code is executed.
> 
> As the comment notes, the assumption is that we won't be adding and
> removing breakpoints except when we're debugging and therefore
> performance is not critical. Windows XP is clearly doing something
> we weren't expecting, so we should ideally have a look at whether
> we can be a bit more efficient about not throwing the whole TB
> cache away.
> 
> thanks
> -- PMM

Thank you for the information. I think there may be additional conditions that may need to be considered before calling tb_flush().
Alex Bennée June 11, 2021, 11:24 a.m. UTC | #4
Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> writes:

> On 10/06/2021 14:14, Peter Maydell wrote:
>
>> On Thu, 10 Jun 2021 at 14:02, Programmingkid <programmingkidx@gmail.com> wrote:
>>>
>>> Hi Richard,
>>>
>>> There is a function called breakpoint_invalidate() in cpu.c that
>>> calls a function called tb_flush(). I have determined that this
>>> call is being made over 200,000 times when Windows XP boots.
>>> Disabling this function makes Windows XP boot way faster than
>>> before. The time went down from around 3 minutes to 20 seconds when
>>> I applied the patch below.
>>>
>>> After I applied the patch I ran several tests in my VM's to see if anything broke. I could not find any problems. Here is the list my VM's I tested:
>>>
>>> Mac OS 10.8 in qemu-system-x86_64
>>> Windows 7 in qemu-system-x86_64
>>> Windows XP in qemu-system-i386
>>> Mac OS 10.4 in qemu-system-ppc
>>>
>>> I would be happy if the patch below was accepted but I would like to know your thoughts.
>> 
>>>   cpu.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/cpu.c b/cpu.c
>>> index bfbe5a66f9..297c2e4281 100644
>>> --- a/cpu.c
>>> +++ b/cpu.c
>>> @@ -253,7 +253,7 @@ static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
>>>        * Flush the whole TB cache to force re-translation of such TBs.
>>>        * This is heavyweight, but we're debugging anyway.
>>>        */
>>> -    tb_flush(cpu);
>>> +    /* tb_flush(cpu); */
>>>   }
>>>   #endif
>> The patch is clearly wrong -- this function is called when a CPU
>> breakpoint
>> is added or removed, and we *must* drop generated code which either
>> (a) includes code to take the breakpoint exception and now should not
>> or (b) doesn't include code to take the breakpoint exception and now should.
>> Otherwise we will incorrectly take/not take a breakpoint exception when
>> that stale code is executed.
>> As the comment notes, the assumption is that we won't be adding and
>> removing breakpoints except when we're debugging and therefore
>> performance is not critical. Windows XP is clearly doing something
>> we weren't expecting, so we should ideally have a look at whether
>> we can be a bit more efficient about not throwing the whole TB
>> cache away.
>
> FWIW this was reported a while back on Launchpad as
> https://bugs.launchpad.net/qemu/+bug/1883593 where the performance
> regression was traced back to:
>
> commit b55f54bc965607c45b5010a107a792ba333ba654
> Author: Max Filippov <jcmvbkbc@gmail.com>
> Date:   Wed Nov 27 14:06:01 2019 -0800
>
>     exec: flush CPU TB cache in breakpoint_invalidate
>
>     When a breakpoint is inserted at location for which there's currently no
>     virtual to physical translation no action is taken on CPU TB cache. If a
>     TB for that virtual address already exists but is not visible ATM the
>     breakpoint won't be hit next time an instruction at that address will be
>     executed.
>
>     Flush entire CPU TB cache in breakpoint_invalidate to force
>     re-translation of all TBs for the breakpoint address.
>
>     This change fixes the following scenario:
>     - linux user application is running
>     - a breakpoint is inserted from QEMU gdbstub for a user address that is
>       not currently present in the target CPU TLB
>     - an instruction at that address is executed, but the external debugger
>       doesn't get control.
>
>     Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
>     Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
>     Message-Id: <20191127220602.10827-2-jcmvbkbc@gmail.com>
>     Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

This was a reversion of a reversion (see 406bc339b0 and a9353fe89). So
we've bounced between solutions several times. Fundamentally if it gets
tricky to ensure your translated state matches the actual machine state
the easiest option is to throw everything away and start again.

> Perhaps Windows XP is constantly executing some kind of breakpoint
> instruction or updating some CPU debug registers during boot?

It would be useful to identify what exactly is triggering the changes
here. If it's old fashioned breakpoint instructions being inserted into
memory we will need to ensure our invalidating of old translations is
rock solid. If we are updating our internal breakpoint/watchpoint
tracking as a result of the guest messing with debug registers maybe we
can do something a bit more finessed?

>
>
> ATB,
>
> Mark.
Programmingkid June 11, 2021, 3:01 p.m. UTC | #5
> On Jun 11, 2021, at 7:24 AM, Alex Bennée <alex.bennee@linaro.org> wrote:
> 
> 
> Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> writes:
> 
>> On 10/06/2021 14:14, Peter Maydell wrote:
>> 
>>> On Thu, 10 Jun 2021 at 14:02, Programmingkid <programmingkidx@gmail.com> wrote:
>>>> 
>>>> Hi Richard,
>>>> 
>>>> There is a function called breakpoint_invalidate() in cpu.c that
>>>> calls a function called tb_flush(). I have determined that this
>>>> call is being made over 200,000 times when Windows XP boots.
>>>> Disabling this function makes Windows XP boot way faster than
>>>> before. The time went down from around 3 minutes to 20 seconds when
>>>> I applied the patch below.
>>>> 
>>>> After I applied the patch I ran several tests in my VM's to see if anything broke. I could not find any problems. Here is the list my VM's I tested:
>>>> 
>>>> Mac OS 10.8 in qemu-system-x86_64
>>>> Windows 7 in qemu-system-x86_64
>>>> Windows XP in qemu-system-i386
>>>> Mac OS 10.4 in qemu-system-ppc
>>>> 
>>>> I would be happy if the patch below was accepted but I would like to know your thoughts.
>>> 
>>>>  cpu.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>> 
>>>> diff --git a/cpu.c b/cpu.c
>>>> index bfbe5a66f9..297c2e4281 100644
>>>> --- a/cpu.c
>>>> +++ b/cpu.c
>>>> @@ -253,7 +253,7 @@ static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
>>>>       * Flush the whole TB cache to force re-translation of such TBs.
>>>>       * This is heavyweight, but we're debugging anyway.
>>>>       */
>>>> -    tb_flush(cpu);
>>>> +    /* tb_flush(cpu); */
>>>>  }
>>>>  #endif
>>> The patch is clearly wrong -- this function is called when a CPU
>>> breakpoint
>>> is added or removed, and we *must* drop generated code which either
>>> (a) includes code to take the breakpoint exception and now should not
>>> or (b) doesn't include code to take the breakpoint exception and now should.
>>> Otherwise we will incorrectly take/not take a breakpoint exception when
>>> that stale code is executed.
>>> As the comment notes, the assumption is that we won't be adding and
>>> removing breakpoints except when we're debugging and therefore
>>> performance is not critical. Windows XP is clearly doing something
>>> we weren't expecting, so we should ideally have a look at whether
>>> we can be a bit more efficient about not throwing the whole TB
>>> cache away.
>> 
>> FWIW this was reported a while back on Launchpad as
>> https://bugs.launchpad.net/qemu/+bug/1883593 where the performance
>> regression was traced back to:
>> 
>> commit b55f54bc965607c45b5010a107a792ba333ba654
>> Author: Max Filippov <jcmvbkbc@gmail.com>
>> Date:   Wed Nov 27 14:06:01 2019 -0800
>> 
>>    exec: flush CPU TB cache in breakpoint_invalidate
>> 
>>    When a breakpoint is inserted at location for which there's currently no
>>    virtual to physical translation no action is taken on CPU TB cache. If a
>>    TB for that virtual address already exists but is not visible ATM the
>>    breakpoint won't be hit next time an instruction at that address will be
>>    executed.
>> 
>>    Flush entire CPU TB cache in breakpoint_invalidate to force
>>    re-translation of all TBs for the breakpoint address.
>> 
>>    This change fixes the following scenario:
>>    - linux user application is running
>>    - a breakpoint is inserted from QEMU gdbstub for a user address that is
>>      not currently present in the target CPU TLB
>>    - an instruction at that address is executed, but the external debugger
>>      doesn't get control.
>> 
>>    Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
>>    Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
>>    Message-Id: <20191127220602.10827-2-jcmvbkbc@gmail.com>
>>    Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> 
> This was a reversion of a reversion (see 406bc339b0 and a9353fe89). So
> we've bounced between solutions several times. Fundamentally if it gets
> tricky to ensure your translated state matches the actual machine state
> the easiest option is to throw everything away and start again.
> 
>> Perhaps Windows XP is constantly executing some kind of breakpoint
>> instruction or updating some CPU debug registers during boot?
> 
> It would be useful to identify what exactly is triggering the changes
> here. If it's old fashioned breakpoint instructions being inserted into
> memory we will need to ensure our invalidating of old translations is
> rock solid. If we are updating our internal breakpoint/watchpoint
> tracking as a result of the guest messing with debug registers maybe we
> can do something a bit more finessed?
> 
>> 
>> 
>> ATB,
>> 
>> Mark.
> 
> 
> -- 
> Alex Bennée

Hello Alex,

The good news is the source code to Windows XP is available online: https://github.com/cryptoAlgorithm/nt5src
Paolo Bonzini June 11, 2021, 5:13 p.m. UTC | #6
On 11/06/21 17:01, Programmingkid wrote:
> Hello Alex,
> 
> The good news is the source code to Windows XP is available online:https://github.com/cryptoAlgorithm/nt5src

It's leaked, so I doubt anybody who's paid to work on Linux or QEMU 
would touch that with a ten-foot pole.

Paolo
Alex Bennée June 11, 2021, 6:22 p.m. UTC | #7
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 11/06/21 17:01, Programmingkid wrote:
>> Hello Alex,
>> The good news is the source code to Windows XP is available
>> online:https://github.com/cryptoAlgorithm/nt5src
>
> It's leaked, so I doubt anybody who's paid to work on Linux or QEMU
> would touch that with a ten-foot pole.

Indeed.

Anyway what the OP could do is run QEMU with gdb and -d nochain and
stick a breakpoint (sic) in breakpoint_invalidate. Then each time it
hits you can examine the backtrace to cpu_loop_exec_tb and collect the
data from tb->pc. Then you will have a bunch of addresses in Windows
that keep triggering the behaviour. You can then re-run with -dfilter
and -d in_asm,cpu to get some sort of idea of what Windows is up to.
Mark Cave-Ayland June 13, 2021, 2:03 p.m. UTC | #8
On 11/06/2021 19:22, Alex Bennée wrote:

(added Gitlab on CC)

> Paolo Bonzini <pbonzini@redhat.com> writes:
> 
>> On 11/06/21 17:01, Programmingkid wrote:
>>> Hello Alex,
>>> The good news is the source code to Windows XP is available
>>> online:https://github.com/cryptoAlgorithm/nt5src
>>
>> It's leaked, so I doubt anybody who's paid to work on Linux or QEMU
>> would touch that with a ten-foot pole.
> 
> Indeed.
> 
> Anyway what the OP could do is run QEMU with gdb and -d nochain and
> stick a breakpoint (sic) in breakpoint_invalidate. Then each time it
> hits you can examine the backtrace to cpu_loop_exec_tb and collect the
> data from tb->pc. Then you will have a bunch of addresses in Windows
> that keep triggering the behaviour. You can then re-run with -dfilter
> and -d in_asm,cpu to get some sort of idea of what Windows is up to.

I have been able to recreate this locally using my WinXP and it looks like during 
boot WinXP goes into a tight loop where it writes and clears a set of breakpoints via 
writes to DB7 which is what causes the very slow boot time.

Once boot proceeds further into the login screen, the same code seems to called 
periodically once every second or so which has less of a performance impact.

For testing I applied the following diff:

diff --git a/cpu.c b/cpu.c
index 164fefeaa3..3155d935f1 100644
--- a/cpu.c
+++ b/cpu.c
@@ -252,13 +252,7 @@ void tb_invalidate_phys_addr(AddressSpace *as, hwaddr addr, 
MemTxAttrs attrs)

  static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
  {
-    /*
-     * There may not be a virtual to physical translation for the pc
-     * right now, but there may exist cached TB for this pc.
-     * Flush the whole TB cache to force re-translation of such TBs.
-     * This is heavyweight, but we're debugging anyway.
-     */
-    tb_flush(cpu);
+    fprintf(stderr, "##### bpi @ 0x%lx\n", pc);
  }
  #endif

diff --git a/target/i386/tcg/sysemu/bpt_helper.c b/target/i386/tcg/sysemu/bpt_helper.c
index 9bdf7e170b..37ee49fd56 100644
--- a/target/i386/tcg/sysemu/bpt_helper.c
+++ b/target/i386/tcg/sysemu/bpt_helper.c
@@ -61,6 +61,7 @@ static int hw_breakpoint_insert(CPUX86State *env, int index)
      switch (hw_breakpoint_type(dr7, index)) {
      case DR7_TYPE_BP_INST:
          if (hw_breakpoint_enabled(dr7, index)) {
+            fprintf(stderr, "### dp7 add bp inst @ 0x%lx, index %d\n", env->eip, index);
              err = cpu_breakpoint_insert(cs, drN, BP_CPU,
                                          &env->cpu_breakpoint[index]);
          }
@@ -102,6 +103,7 @@ static void hw_breakpoint_remove(CPUX86State *env, int index)
      switch (hw_breakpoint_type(env->dr[7], index)) {
      case DR7_TYPE_BP_INST:
          if (env->cpu_breakpoint[index]) {
+            fprintf(stderr, "### dp7 remove bp inst @ 0x%lx, index %d\n", env->eip, 
index);
              cpu_breakpoint_remove_by_ref(cs, env->cpu_breakpoint[index]);
              env->cpu_breakpoint[index] = NULL;
          }


This gives a repeated set of outputs like this:

##### bpi @ 0x90
### dp7 add bp inst @ 0x8053cab8, index 1
##### bpi @ 0xa4
### dp7 add bp inst @ 0x8053cab8, index 2
##### bpi @ 0xff
### dp7 add bp inst @ 0x8053cab8, index 3
##### bpi @ 0xf
### dp7 remove bp inst @ 0x8053f58a, index 0
##### bpi @ 0x90
### dp7 remove bp inst @ 0x8053f58a, index 1
##### bpi @ 0xa4
### dp7 remove bp inst @ 0x8053f58a, index 2
##### bpi @ 0xff
### dp7 remove bp inst @ 0x8053f58a, index 3
...
...
### dp7 add bp inst @ 0x8053c960, index 0
##### bpi @ 0x90
### dp7 add bp inst @ 0x8053c960, index 1
##### bpi @ 0xa4
### dp7 add bp inst @ 0x8053c960, index 2
##### bpi @ 0xff
### dp7 add bp inst @ 0x8053c960, index 3
##### bpi @ 0xf
### dp7 remove bp inst @ 0x8053c730, index 0
##### bpi @ 0x90
### dp7 remove bp inst @ 0x8053c730, index 1
##### bpi @ 0xa4
### dp7 remove bp inst @ 0x8053c730, index 2
##### bpi @ 0xff
### dp7 remove bp inst @ 0x8053c730, index 3
...
...

 From a vanilla XP install the 2 main sections of code which alter the breakpoint 
registers are at 0x8053cab8 (enable) and 0x8053f58a (disable):


-d in_asm output when enabling breakpoints
==========================================

----------------
IN:
0x8053ca92:  33 db                    xorl     %ebx, %ebx
0x8053ca94:  8b 75 18                 movl     0x18(%ebp), %esi
0x8053ca97:  8b 7d 1c                 movl     0x1c(%ebp), %edi
0x8053ca9a:  0f 23 fb                 movl     %ebx, %dr7

----------------
IN:
0x8053ca9d:  0f 23 c6                 movl     %esi, %dr0

----------------
IN:
0x8053caa0:  8b 5d 20                 movl     0x20(%ebp), %ebx
0x8053caa3:  0f 23 cf                 movl     %edi, %dr1

----------------
IN:
0x8053caa6:  0f 23 d3                 movl     %ebx, %dr2

----------------
IN:
0x8053caa9:  8b 75 24                 movl     0x24(%ebp), %esi
0x8053caac:  8b 7d 28                 movl     0x28(%ebp), %edi
0x8053caaf:  8b 5d 2c                 movl     0x2c(%ebp), %ebx
0x8053cab2:  0f 23 de                 movl     %esi, %dr3

----------------
IN:
0x8053cab5:  0f 23 f7                 movl     %edi, %dr6

----------------
IN:
0x8053cab8:  0f 23 fb                 movl     %ebx, %dr7

### dp7 add bp inst @ 0x8053cab8, index 0
##### bpi @ 0x90
### dp7 add bp inst @ 0x8053cab8, index 1
##### bpi @ 0xa4
### dp7 add bp inst @ 0x8053cab8, index 2
##### bpi @ 0xff
### dp7 add bp inst @ 0x8053cab8, index 3
##### bpi @ 0xf
----------------
IN:
0x8053cabb:  e9 6f ff ff ff           jmp      0x8053ca2f


-d in_asm output when disabling breakpoints
===========================================

IN:
0x8053f58a:  0f 21 c3                 movl     %dr0, %ebx
0x8053f58d:  0f 21 c9                 movl     %dr1, %ecx
0x8053f590:  0f 21 d7                 movl     %dr2, %edi
0x8053f593:  89 5d 18                 movl     %ebx, 0x18(%ebp)
0x8053f596:  89 4d 1c                 movl     %ecx, 0x1c(%ebp)
0x8053f599:  89 7d 20                 movl     %edi, 0x20(%ebp)
0x8053f59c:  0f 21 db                 movl     %dr3, %ebx
0x8053f59f:  0f 21 f1                 movl     %dr6, %ecx
0x8053f5a2:  0f 21 ff                 movl     %dr7, %edi
0x8053f5a5:  89 5d 24                 movl     %ebx, 0x24(%ebp)
0x8053f5a8:  89 4d 28                 movl     %ecx, 0x28(%ebp)
0x8053f5ab:  33 db                    xorl     %ebx, %ebx
0x8053f5ad:  89 7d 2c                 movl     %edi, 0x2c(%ebp)
0x8053f5b0:  0f 23 fb                 movl     %ebx, %dr7

### dp7 remove bp inst @ 0x8053f58a, index 0
##### bpi @ 0x90
### dp7 remove bp inst @ 0x8053f58a, index 1
##### bpi @ 0xa4
### dp7 remove bp inst @ 0x8053f58a, index 2
##### bpi @ 0xff
### dp7 remove bp inst @ 0x8053f58a, index 3
##### bpi @ 0xf
----------------
IN:
0x8053f5b3:  64 8b 3d 20 00 00 00     movl     %fs:0x20, %edi
0x8053f5ba:  8b 9f f8 02 00 00        movl     0x2f8(%edi), %ebx
0x8053f5c0:  8b 8f fc 02 00 00        movl     0x2fc(%edi), %ecx
0x8053f5c6:  0f 23 c3                 movl     %ebx, %dr0

----------------
IN:
0x8053f5c9:  0f 23 c9                 movl     %ecx, %dr1

----------------
IN:
0x8053f5cc:  8b 9f 00 03 00 00        movl     0x300(%edi), %ebx
0x8053f5d2:  8b 8f 04 03 00 00        movl     0x304(%edi), %ecx
0x8053f5d8:  0f 23 d3                 movl     %ebx, %dr2

----------------
IN:
0x8053f5db:  0f 23 d9                 movl     %ecx, %dr3

----------------
IN:
0x8053f5de:  8b 9f 08 03 00 00        movl     0x308(%edi), %ebx
0x8053f5e4:  8b 8f 0c 03 00 00        movl     0x30c(%edi), %ecx
0x8053f5ea:  0f 23 f3                 movl     %ebx, %dr6

----------------
IN:
0x8053f5ed:  0f 23 f9                 movl     %ecx, %dr7

----------------
IN:
0x8053f5f0:  e9 8f 00 00 00           jmp      0x8053f684



ATB,

Mark.
Alex Bennée June 14, 2021, 2:37 p.m. UTC | #9
Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> writes:

> On 11/06/2021 19:22, Alex Bennée wrote:
>
> (added Gitlab on CC)
>
>> Paolo Bonzini <pbonzini@redhat.com> writes:
>> 
>>> On 11/06/21 17:01, Programmingkid wrote:
>>>> Hello Alex,
>>>> The good news is the source code to Windows XP is available
>>>> online:https://github.com/cryptoAlgorithm/nt5src
>>>
>>> It's leaked, so I doubt anybody who's paid to work on Linux or QEMU
>>> would touch that with a ten-foot pole.
>> Indeed.
>> Anyway what the OP could do is run QEMU with gdb and -d nochain and
>> stick a breakpoint (sic) in breakpoint_invalidate. Then each time it
>> hits you can examine the backtrace to cpu_loop_exec_tb and collect the
>> data from tb->pc. Then you will have a bunch of addresses in Windows
>> that keep triggering the behaviour. You can then re-run with -dfilter
>> and -d in_asm,cpu to get some sort of idea of what Windows is up to.
>
> I have been able to recreate this locally using my WinXP and it looks
> like during boot WinXP goes into a tight loop where it writes and
> clears a set of breakpoints via writes to DB7 which is what causes the
> very slow boot time.
>
> Once boot proceeds further into the login screen, the same code seems
> to called periodically once every second or so which has less of a
> performance impact.
>
>
> This gives a repeated set of outputs like this:
>
> ##### bpi @ 0x90
> ### dp7 add bp inst @ 0x8053cab8, index 1
> ##### bpi @ 0xa4
> ### dp7 add bp inst @ 0x8053cab8, index 2
> ##### bpi @ 0xff
> ### dp7 add bp inst @ 0x8053cab8, index 3
> ##### bpi @ 0xf

That's weird - maybe this is a misunderstanding of the x86 debug
registers but it looks like it's setting each one to all the same value. 

> ### dp7 remove bp inst @ 0x8053f58a, index 0
> ##### bpi @ 0x90
> ### dp7 remove bp inst @ 0x8053f58a, index 1
> ##### bpi @ 0xa4
> ### dp7 remove bp inst @ 0x8053f58a, index 2
> ##### bpi @ 0xff
> ### dp7 remove bp inst @ 0x8053f58a, index 3
> ...
> ...
> ### dp7 add bp inst @ 0x8053c960, index 0
> ##### bpi @ 0x90
> ### dp7 add bp inst @ 0x8053c960, index 1
> ##### bpi @ 0xa4
> ### dp7 add bp inst @ 0x8053c960, index 2
> ##### bpi @ 0xff
> ### dp7 add bp inst @ 0x8053c960, index 3
> ##### bpi @ 0xf
> ### dp7 remove bp inst @ 0x8053c730, index 0
> ##### bpi @ 0x90
> ### dp7 remove bp inst @ 0x8053c730, index 1
> ##### bpi @ 0xa4
> ### dp7 remove bp inst @ 0x8053c730, index 2
> ##### bpi @ 0xff
> ### dp7 remove bp inst @ 0x8053c730, index 3
> ...
> ...

I wonder if this is Windows check pointing itself by observing when it
gets to a particular place in the boot sequence. I guess we don't have
any symbols for the addresses it's setting?

>
> From a vanilla XP install the 2 main sections of code which alter the
> breakpoint registers are at 0x8053cab8 (enable) and 0x8053f58a
> (disable):

Ahh I misread - so those are the addresses of the routines and not where
it's sticking the breakpoint?

I notice from a bit of googling that there is a boot debugger. I wonder
if /nodebug in boot.ini stops this behaviour?

  https://docs.microsoft.com/en-us/troubleshoot/windows-server/performance/switch-options-for-boot-files
no-reply@patchew.org June 14, 2021, 10:19 p.m. UTC | #10
Patchew URL: https://patchew.org/QEMU/BCB8773B-FC54-4C25-9B60-92C263165D38@gmail.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: BCB8773B-FC54-4C25-9B60-92C263165D38@gmail.com
Subject: tb_flush() calls causing long Windows XP boot times

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 * [new tag]         patchew/20210610213906.1313440-1-eblake@redhat.com -> patchew/20210610213906.1313440-1-eblake@redhat.com
 * [new tag]         patchew/20210614083800.1166166-1-richard.henderson@linaro.org -> patchew/20210614083800.1166166-1-richard.henderson@linaro.org
Switched to a new branch 'test'
0c48784 tb_flush() calls causing long Windows XP boot times

=== OUTPUT BEGIN ===
ERROR: Missing Signed-off-by: line(s)

total: 1 errors, 0 warnings, 8 lines checked

Commit 0c48784df7c4 (tb_flush() calls causing long Windows XP boot times) has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/BCB8773B-FC54-4C25-9B60-92C263165D38@gmail.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com
Programmingkid June 15, 2021, 1:58 p.m. UTC | #11
> On Jun 14, 2021, at 10:37 AM, Alex Bennée <alex.bennee@linaro.org> wrote:
> 
> Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> writes:
> 
>> On 11/06/2021 19:22, Alex Bennée wrote:
>> 
>> (added Gitlab on CC)
>> 
>>> Paolo Bonzini <pbonzini@redhat.com> writes:
>>> 
>>>> On 11/06/21 17:01, Programmingkid wrote:
>>>>> Hello Alex,
>>>>> The good news is the source code to Windows XP is available
>>>>> online:https://github.com/cryptoAlgorithm/nt5src
>>>> 
>>>> It's leaked, so I doubt anybody who's paid to work on Linux or QEMU
>>>> would touch that with a ten-foot pole.
>>> Indeed.
>>> Anyway what the OP could do is run QEMU with gdb and -d nochain and
>>> stick a breakpoint (sic) in breakpoint_invalidate. Then each time it
>>> hits you can examine the backtrace to cpu_loop_exec_tb and collect the
>>> data from tb->pc. Then you will have a bunch of addresses in Windows
>>> that keep triggering the behaviour. You can then re-run with -dfilter
>>> and -d in_asm,cpu to get some sort of idea of what Windows is up to.
>> 
>> I have been able to recreate this locally using my WinXP and it looks
>> like during boot WinXP goes into a tight loop where it writes and
>> clears a set of breakpoints via writes to DB7 which is what causes the
>> very slow boot time.
>> 
>> Once boot proceeds further into the login screen, the same code seems
>> to called periodically once every second or so which has less of a
>> performance impact.
>> 
>> 
>> This gives a repeated set of outputs like this:
>> 
>> ##### bpi @ 0x90
>> ### dp7 add bp inst @ 0x8053cab8, index 1
>> ##### bpi @ 0xa4
>> ### dp7 add bp inst @ 0x8053cab8, index 2
>> ##### bpi @ 0xff
>> ### dp7 add bp inst @ 0x8053cab8, index 3
>> ##### bpi @ 0xf
> 
> That's weird - maybe this is a misunderstanding of the x86 debug
> registers but it looks like it's setting each one to all the same value. 
> 
>> ### dp7 remove bp inst @ 0x8053f58a, index 0
>> ##### bpi @ 0x90
>> ### dp7 remove bp inst @ 0x8053f58a, index 1
>> ##### bpi @ 0xa4
>> ### dp7 remove bp inst @ 0x8053f58a, index 2
>> ##### bpi @ 0xff
>> ### dp7 remove bp inst @ 0x8053f58a, index 3
>> ...
>> ...
>> ### dp7 add bp inst @ 0x8053c960, index 0
>> ##### bpi @ 0x90
>> ### dp7 add bp inst @ 0x8053c960, index 1
>> ##### bpi @ 0xa4
>> ### dp7 add bp inst @ 0x8053c960, index 2
>> ##### bpi @ 0xff
>> ### dp7 add bp inst @ 0x8053c960, index 3
>> ##### bpi @ 0xf
>> ### dp7 remove bp inst @ 0x8053c730, index 0
>> ##### bpi @ 0x90
>> ### dp7 remove bp inst @ 0x8053c730, index 1
>> ##### bpi @ 0xa4
>> ### dp7 remove bp inst @ 0x8053c730, index 2
>> ##### bpi @ 0xff
>> ### dp7 remove bp inst @ 0x8053c730, index 3
>> ...
>> ...
> 
> I wonder if this is Windows check pointing itself by observing when it
> gets to a particular place in the boot sequence. I guess we don't have
> any symbols for the addresses it's setting?
> 
>> 
>> From a vanilla XP install the 2 main sections of code which alter the
>> breakpoint registers are at 0x8053cab8 (enable) and 0x8053f58a
>> (disable):
> 
> Ahh I misread - so those are the addresses of the routines and not where
> it's sticking the breakpoint?
> 
> I notice from a bit of googling that there is a boot debugger. I wonder
> if /nodebug in boot.ini stops this behaviour?
> 
>  https://docs.microsoft.com/en-us/troubleshoot/windows-server/performance/switch-options-for-boot-files
> 
> -- 
> Alex Bennée

Hi Alex, 

I tried your suggestion of using /nodebug. It did not stop the tb_flush() function from being called.
Richard Henderson June 16, 2021, 1:58 a.m. UTC | #12
On 6/15/21 6:58 AM, Programmingkid wrote:
>> Ahh I misread - so those are the addresses of the routines and not where
>> it's sticking the breakpoint?
>>
>> I notice from a bit of googling that there is a boot debugger. I wonder
>> if /nodebug in boot.ini stops this behaviour?
>>
>>   https://docs.microsoft.com/en-us/troubleshoot/windows-server/performance/switch-options-for-boot-files
>>
>> -- 
>> Alex Bennée
> 
> Hi Alex,
> 
> I tried your suggestion of using /nodebug. It did not stop the tb_flush() function from being called.

We are not expecting zero calls to tb_flush (it is used for other things, including buffer 
full), but we are hoping that it reduces the frequency of the calls.

I'm guessing you didn't immediately see the slowdown vanish, and so there was no change to 
the frequency of the calls.

FWIW, if you switch to the qemu console, you can see how many flushes have occurred with 
"info jit".


r~
Mark Cave-Ayland June 16, 2021, 8:59 a.m. UTC | #13
On 16/06/2021 02:58, Richard Henderson wrote:

> On 6/15/21 6:58 AM, Programmingkid wrote:
>>> Ahh I misread - so those are the addresses of the routines and not where
>>> it's sticking the breakpoint?
>>>
>>> I notice from a bit of googling that there is a boot debugger. I wonder
>>> if /nodebug in boot.ini stops this behaviour?
>>>
>>>   
>>> https://docs.microsoft.com/en-us/troubleshoot/windows-server/performance/switch-options-for-boot-files 
>>>
>>>
>>> -- 
>>> Alex Bennée
>>
>> Hi Alex,
>>
>> I tried your suggestion of using /nodebug. It did not stop the tb_flush() function 
>> from being called.
> 
> We are not expecting zero calls to tb_flush (it is used for other things, including 
> buffer full), but we are hoping that it reduces the frequency of the calls.
> 
> I'm guessing you didn't immediately see the slowdown vanish, and so there was no 
> change to the frequency of the calls.
> 
> FWIW, if you switch to the qemu console, you can see how many flushes have occurred 
> with "info jit".

Looking at the diff of b55f54bc96 which first introduced the regression, presumably 
the difference is now that everything is being flushed rather than a specific address 
space when WinXP twiddles with the DB7 register:


diff --git a/exec.c b/exec.c
index 67e520d18e..7f4074f95e 100644
--- a/exec.c
+++ b/exec.c
@@ -1019,14 +1019,13 @@ void tb_invalidate_phys_addr(AddressSpace *as, hwaddr addr, 
MemTxAttrs attrs)

  static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
  {
-    MemTxAttrs attrs;
-    hwaddr phys = cpu_get_phys_page_attrs_debug(cpu, pc, &attrs);
-    int asidx = cpu_asidx_from_attrs(cpu, attrs);
-    if (phys != -1) {
-        /* Locks grabbed by tb_invalidate_phys_addr */
-        tb_invalidate_phys_addr(cpu->cpu_ases[asidx].as,
-                                phys | (pc & ~TARGET_PAGE_MASK), attrs);
-    }
+    /*
+     * There may not be a virtual to physical translation for the pc
+     * right now, but there may exist cached TB for this pc.
+     * Flush the whole TB cache to force re-translation of such TBs.
+     * This is heavyweight, but we're debugging anyway.
+     */
+    tb_flush(cpu);
  }
  #endif


Unfortunately my x86-fu isn't really enough to understand what the solution should be 
in this case.


ATB,

Mark.
Programmingkid June 16, 2021, 12:12 p.m. UTC | #14
> On Jun 15, 2021, at 9:58 PM, Richard Henderson <richard.henderson@linaro.org> wrote:
> 
> On 6/15/21 6:58 AM, Programmingkid wrote:
>>> Ahh I misread - so those are the addresses of the routines and not where
>>> it's sticking the breakpoint?
>>> 
>>> I notice from a bit of googling that there is a boot debugger. I wonder
>>> if /nodebug in boot.ini stops this behaviour?
>>> 
>>>  https://docs.microsoft.com/en-us/troubleshoot/windows-server/performance/switch-options-for-boot-files
>>> 
>>> -- 
>>> Alex Bennée
>> Hi Alex,
>> I tried your suggestion of using /nodebug. It did not stop the tb_flush() function from being called.
> 
> We are not expecting zero calls to tb_flush (it is used for other things, including buffer full), but we are hoping that it reduces the frequency of the calls.

Agreed.

> I'm guessing you didn't immediately see the slowdown vanish, and so there was no change to the frequency of the calls.

Correct.

> FWIW, if you switch to the qemu console, you can see how many flushes have occurred with "info jit".

Thank you very much for this information.

I'm currently learning about the x86's debug registers D0 to D7. There are a lot of rules associated with them. So my guess is one or more rules may not be implemented in QEMU. I will try to test them out in FreeDOS and compare notes with a real x86 CPU.

A possible workaround might be to implement a command line option that allows the user to specify how often the tb_flush() call is made. When I eliminated the call I could not find any problems with my VM's. I understand if this is not possible.
Alex Bennée June 16, 2021, 12:53 p.m. UTC | #15
Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> writes:

> On 16/06/2021 02:58, Richard Henderson wrote:
>
>> On 6/15/21 6:58 AM, Programmingkid wrote:
>>>> Ahh I misread - so those are the addresses of the routines and not where
>>>> it's sticking the breakpoint?
>>>>
>>>> I notice from a bit of googling that there is a boot debugger. I wonder
>>>> if /nodebug in boot.ini stops this behaviour?
>>>>
>>>>   https://docs.microsoft.com/en-us/troubleshoot/windows-server/performance/switch-options-for-boot-files
>>>> -- Alex Bennée
>>>
>>> Hi Alex,
>>>
>>> I tried your suggestion of using /nodebug. It did not stop the
>>> tb_flush() function from being called.
>> We are not expecting zero calls to tb_flush (it is used for other
>> things, including buffer full), but we are hoping that it reduces
>> the frequency of the calls.
>> I'm guessing you didn't immediately see the slowdown vanish, and so
>> there was no change to the frequency of the calls.
>> FWIW, if you switch to the qemu console, you can see how many
>> flushes have occurred with "info jit".
>
> Looking at the diff of b55f54bc96 which first introduced the
> regression, presumably the difference is now that everything is being
> flushed rather than a specific address space when WinXP twiddles with
> the DB7 register:
>
>
> diff --git a/exec.c b/exec.c
> index 67e520d18e..7f4074f95e 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1019,14 +1019,13 @@ void tb_invalidate_phys_addr(AddressSpace *as,
> hwaddr addr, MemTxAttrs attrs)
>
>  static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
>  {
> -    MemTxAttrs attrs;
> -    hwaddr phys = cpu_get_phys_page_attrs_debug(cpu, pc, &attrs);
> -    int asidx = cpu_asidx_from_attrs(cpu, attrs);
> -    if (phys != -1) {
> -        /* Locks grabbed by tb_invalidate_phys_addr */
> -        tb_invalidate_phys_addr(cpu->cpu_ases[asidx].as,
> -                                phys | (pc & ~TARGET_PAGE_MASK), attrs);
> -    }
> +    /*
> +     * There may not be a virtual to physical translation for the pc
> +     * right now, but there may exist cached TB for this pc.
> +     * Flush the whole TB cache to force re-translation of such TBs.
> +     * This is heavyweight, but we're debugging anyway.
> +     */
> +    tb_flush(cpu);
>  }
>  #endif
>
>
> Unfortunately my x86-fu isn't really enough to understand what the
> solution should be in this case.

It's not really an x86 issue here but that we don't have any easy way of
finding the subset of TranslationBlock's that might be affected. We can
only query the QHT for a head address + flags. Meanwhile when there is
an active mapping we go through the page tables 


>
>
> ATB,
>
> Mark.
>
>
Peter Maydell June 16, 2021, 1:06 p.m. UTC | #16
On Wed, 16 Jun 2021 at 13:53, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> writes:
> > diff --git a/exec.c b/exec.c
> > index 67e520d18e..7f4074f95e 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -1019,14 +1019,13 @@ void tb_invalidate_phys_addr(AddressSpace *as,
> > hwaddr addr, MemTxAttrs attrs)
> >
> >  static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
> >  {
> > -    MemTxAttrs attrs;
> > -    hwaddr phys = cpu_get_phys_page_attrs_debug(cpu, pc, &attrs);
> > -    int asidx = cpu_asidx_from_attrs(cpu, attrs);
> > -    if (phys != -1) {
> > -        /* Locks grabbed by tb_invalidate_phys_addr */
> > -        tb_invalidate_phys_addr(cpu->cpu_ases[asidx].as,
> > -                                phys | (pc & ~TARGET_PAGE_MASK), attrs);
> > -    }
> > +    /*
> > +     * There may not be a virtual to physical translation for the pc
> > +     * right now, but there may exist cached TB for this pc.
> > +     * Flush the whole TB cache to force re-translation of such TBs.
> > +     * This is heavyweight, but we're debugging anyway.
> > +     */
> > +    tb_flush(cpu);
> >  }
> >  #endif
> >
> >
> > Unfortunately my x86-fu isn't really enough to understand what the
> > solution should be in this case.
>
> It's not really an x86 issue here but that we don't have any easy way of
> finding the subset of TranslationBlock's that might be affected. We can
> only query the QHT for a head address + flags. Meanwhile when there is
> an active mapping we go through the page tables

Could we do something where we zap the TBs here where there is an active
virtual-to-physical mapping for this PC, and also make a record of affected
PCs (or PC ranges) so that before we add a new entry to the
virtual-to-physical mapping we check the record to see if we actually need
to flush this TB? I think if you flush all the TLBs at this point then
you can do the "check before adding new entry" part in
tlb_set_page_with_attrs(),
but I'm not super familiar with the execution flow of TCG so that might be
wrong. Also there needs to be a point where we can discard entries from
our "dump this TB for this PC" records so they don't just grow indefinitely,
and I'm not sure what that would be.

thanks
-- PMM
Alex Bennée June 16, 2021, 1:21 p.m. UTC | #17
Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> writes:

> On 16/06/2021 02:58, Richard Henderson wrote:
>
>> On 6/15/21 6:58 AM, Programmingkid wrote:
>>>> Ahh I misread - so those are the addresses of the routines and not where
>>>> it's sticking the breakpoint?
>>>>
>>>> I notice from a bit of googling that there is a boot debugger. I wonder
>>>> if /nodebug in boot.ini stops this behaviour?
>>>>
>>>>   https://docs.microsoft.com/en-us/troubleshoot/windows-server/performance/switch-options-for-boot-files
>>>> -- Alex Bennée
>>>
>>> Hi Alex,
>>>
>>> I tried your suggestion of using /nodebug. It did not stop the
>>> tb_flush() function from being called.
>> We are not expecting zero calls to tb_flush (it is used for other
>> things, including buffer full), but we are hoping that it reduces
>> the frequency of the calls.
>> I'm guessing you didn't immediately see the slowdown vanish, and so
>> there was no change to the frequency of the calls.
>> FWIW, if you switch to the qemu console, you can see how many
>> flushes have occurred with "info jit".
>
> Looking at the diff of b55f54bc96 which first introduced the
> regression, presumably the difference is now that everything is being
> flushed rather than a specific address space when WinXP twiddles with
> the DB7 register:
>
>
> diff --git a/exec.c b/exec.c
> index 67e520d18e..7f4074f95e 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1019,14 +1019,13 @@ void tb_invalidate_phys_addr(AddressSpace *as,
> hwaddr addr, MemTxAttrs attrs)
>
>  static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
>  {
> -    MemTxAttrs attrs;
> -    hwaddr phys = cpu_get_phys_page_attrs_debug(cpu, pc, &attrs);
> -    int asidx = cpu_asidx_from_attrs(cpu, attrs);
> -    if (phys != -1) {
> -        /* Locks grabbed by tb_invalidate_phys_addr */
> -        tb_invalidate_phys_addr(cpu->cpu_ases[asidx].as,
> -                                phys | (pc & ~TARGET_PAGE_MASK), attrs);
> -    }
> +    /*
> +     * There may not be a virtual to physical translation for the pc
> +     * right now, but there may exist cached TB for this pc.
> +     * Flush the whole TB cache to force re-translation of such TBs.
> +     * This is heavyweight, but we're debugging anyway.
> +     */
> +    tb_flush(cpu);
>  }
>  #endif
>
>
> Unfortunately my x86-fu isn't really enough to understand what the
> solution should be in this case.

It's not really an x86 issue here but more of an internal accounting
issue of how we find TranslationBlocks when we need to tweak them.

For general purpose running we can query the QHT for a initial PC +
flags. This is fine for finding the next block to run but doesn't help
in this case because the breakpoint could be in any sub-address of the
block. So to be sure we just nuke all the blocks in the page that is
affected.

The problem is our route to finding the list of blocks is through the
page tables. However the problem exists when there isn't an currently
active map for a virtual to physical address (which is what happens in
the reported race condition, where the kernel may temporarily page out a
user page).

We could just iterate through all the TB's but there are a lot and that
will take some time. The other option would be somehow marking the
ultimate page entry with some sort of generation count which we could
check when doing a tb_lookup and invalidating the TB then and forcing a
new generation. I need to have a better understanding of the lifecycle
of the pages in the page cache code to see how we could approach this.
Alex Bennée June 16, 2021, 3:30 p.m. UTC | #18
Peter Maydell <peter.maydell@linaro.org> writes:

> On Wed, 16 Jun 2021 at 13:53, Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>> Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> writes:
>> > diff --git a/exec.c b/exec.c
>> > index 67e520d18e..7f4074f95e 100644
>> > --- a/exec.c
>> > +++ b/exec.c
>> > @@ -1019,14 +1019,13 @@ void tb_invalidate_phys_addr(AddressSpace *as,
>> > hwaddr addr, MemTxAttrs attrs)
>> >
>> >  static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
>> >  {
>> > -    MemTxAttrs attrs;
>> > -    hwaddr phys = cpu_get_phys_page_attrs_debug(cpu, pc, &attrs);
>> > -    int asidx = cpu_asidx_from_attrs(cpu, attrs);
>> > -    if (phys != -1) {
>> > -        /* Locks grabbed by tb_invalidate_phys_addr */
>> > -        tb_invalidate_phys_addr(cpu->cpu_ases[asidx].as,
>> > -                                phys | (pc & ~TARGET_PAGE_MASK), attrs);
>> > -    }
>> > +    /*
>> > +     * There may not be a virtual to physical translation for the pc
>> > +     * right now, but there may exist cached TB for this pc.
>> > +     * Flush the whole TB cache to force re-translation of such TBs.
>> > +     * This is heavyweight, but we're debugging anyway.
>> > +     */
>> > +    tb_flush(cpu);
>> >  }
>> >  #endif
>> >
>> >
>> > Unfortunately my x86-fu isn't really enough to understand what the
>> > solution should be in this case.
>>
>> It's not really an x86 issue here but that we don't have any easy way of
>> finding the subset of TranslationBlock's that might be affected. We can
>> only query the QHT for a head address + flags. Meanwhile when there is
>> an active mapping we go through the page tables
>
> Could we do something where we zap the TBs here where there is an active
> virtual-to-physical mapping for this PC, and also make a record of affected
> PCs (or PC ranges) so that before we add a new entry to the
> virtual-to-physical mapping we check the record to see if we actually need
> to flush this TB? I think if you flush all the TLBs at this point then
> you can do the "check before adding new entry" part in
> tlb_set_page_with_attrs(),
> but I'm not super familiar with the execution flow of TCG so that might be
> wrong.

So in breakpoint_invalidate can we actually probe for the existence of
an active mapping for a given virt<->phys entry? If there is we call the
tb_invalidate_phys_addr as before, if not save the data which we check
when updating the softmmu tlb.

> Also there needs to be a point where we can discard entries from
> our "dump this TB for this PC" records so they don't just grow indefinitely,
> and I'm not sure what that would be.

I wondered if there was a way to use a bloom filter for this? But there
doesn't seem to be an easy way of removing entries once you've done the
thing you wanted to do. I guess we could just reset when all breakpoints
are cleared or we do a tb_flush() for other reasons.
diff mbox series

Patch

diff --git a/cpu.c b/cpu.c
index bfbe5a66f9..297c2e4281 100644
--- a/cpu.c
+++ b/cpu.c
@@ -253,7 +253,7 @@  static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
      * Flush the whole TB cache to force re-translation of such TBs.
      * This is heavyweight, but we're debugging anyway.
      */
-    tb_flush(cpu);
+    /* tb_flush(cpu); */
 }
 #endif