diff mbox series

[kvm-unit-tests,1/2] x86: access: Fix timeout failure by limiting number of flag combinations

Message ID 162826611747.32391.16149996928851353357.stgit@bmoger-ubuntu (mailing list archive)
State New, archived
Headers show
Series Couple of SVM fixes | expand

Commit Message

Babu Moger Aug. 6, 2021, 4:08 p.m. UTC
From: Babu Moger <Babu.Moger@amd.com>

The test ./x86/access fails with a timeout. This is due to the number test
combination. The test cases increase exponentially as the features get
enabled. The new machine adds the feature AC_CPU_CR4_PKE. The default
timeout is 180 seconds. Seen this problem both on AMD and Intel machines.

#./tests/access
qemu-system-x86_64: terminating on signal 15 from pid 20050 (timeout)
FAIL access (timeout; duration=180)

This test can take about 7 minutes without timeout.
time ./tests/access
58982405 tests, 0 failures
PASS access

real	7m10.063s
user	7m9.063s
sys	0m0.309s

Fix the problem by adding few more limit checks.

Signed-off-by: Babu Moger <Babu.Moger@amd.com>
---
 x86/access.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Babu Moger Aug. 9, 2021, 7:43 p.m. UTC | #1
On 8/6/21 11:53 AM, Sean Christopherson wrote:
> On Fri, Aug 06, 2021, Babu Moger wrote:
>> From: Babu Moger <Babu.Moger@amd.com>
>>
>> The test ./x86/access fails with a timeout. This is due to the number test
>> combination. The test cases increase exponentially as the features get
>> enabled. The new machine adds the feature AC_CPU_CR4_PKE. The default
>> timeout is 180 seconds. Seen this problem both on AMD and Intel machines.
>>
>> #./tests/access
>> qemu-system-x86_64: terminating on signal 15 from pid 20050 (timeout)
>> FAIL access (timeout; duration=180)
>>
>> This test can take about 7 minutes without timeout.
>> time ./tests/access
>> 58982405 tests, 0 failures
>> PASS access
>>
>> real	7m10.063s
>> user	7m9.063s
>> sys	0m0.309s
>>
>> Fix the problem by adding few more limit checks.
> 
> Please state somewhere in the changelog what is actually being changed, and the
> actual effect of the change.  E.g.
> 
>   Disallow protection keys testcase in combination with reserved bit
>   testcasess to further limit the number of tests run to avoid timeouts on
>   systems with support for protection keys.
> 
>   Disallowing this combination reduces the total number of tests from
>   58982405 to <???>, and the runtime from ~7 minutes to <???>

Sure. Will do.
> 
>> Signed-off-by: Babu Moger <Babu.Moger@amd.com>
>> ---
>>  x86/access.c |    4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/x86/access.c b/x86/access.c
>> index 47807cc..e371dd5 100644
>> --- a/x86/access.c
>> +++ b/x86/access.c
>> @@ -317,9 +317,9 @@ static _Bool ac_test_legal(ac_test_t *at)
>>      /*
>>       * Shorten the test by avoiding testing too many reserved bit combinations
>>       */
>> -    if ((F(AC_PDE_BIT51) + F(AC_PDE_BIT36) + F(AC_PDE_BIT13)) > 1)
>> +    if ((F(AC_PDE_BIT51) + F(AC_PDE_BIT36) + F(AC_PDE_BIT13) + F(AC_CPU_CR4_PKE)) > 1)
>>          return false;
>> -    if ((F(AC_PTE_BIT51) + F(AC_PTE_BIT36)) > 1)
>> +    if ((F(AC_PTE_BIT51) + F(AC_PTE_BIT36) + F(AC_CPU_CR4_PKE)) > 1)
> 
> Why are protection keys the sacrifical lamb?  Simply because they're the newest?

Yes. I added it because it was new ):.
> 
> And before we start killing off arbitrary combinations, what about sharding the
> test so that testcases that depend on a specific CR0/CR4/EFER bit, i.e. trigger
> a VM-Exit when the configuration changes, are separate runs?  Being able to run
> a specific combination would also hopefully make it easier to debug issues as
> the user could specify which combo to run without having to modify the code and
> recompile.
> 
> That probably won't actually reduce the total run time, but it would make each
> run a separate test and give developers a warm fuzzy feeling that they're indeed
> making progress :-)
> 
> Not sure how this could be automagically expressed this in unittest.cfg though...

Let me investigate if we can do that fairly easy. Will let you know.
Thanks
Babu
> 
>>          return false;
>>  
>>      return true;
>>
Babu Moger Aug. 10, 2021, 4:59 p.m. UTC | #2
On 8/9/21 2:43 PM, Babu Moger wrote:
> 
> 
> On 8/6/21 11:53 AM, Sean Christopherson wrote:
>> On Fri, Aug 06, 2021, Babu Moger wrote:
>>> From: Babu Moger <Babu.Moger@amd.com>
>>>
>>> The test ./x86/access fails with a timeout. This is due to the number test
>>> combination. The test cases increase exponentially as the features get
>>> enabled. The new machine adds the feature AC_CPU_CR4_PKE. The default
>>> timeout is 180 seconds. Seen this problem both on AMD and Intel machines.
>>>
>>> #./tests/access
>>> qemu-system-x86_64: terminating on signal 15 from pid 20050 (timeout)
>>> FAIL access (timeout; duration=180)
>>>
>>> This test can take about 7 minutes without timeout.
>>> time ./tests/access
>>> 58982405 tests, 0 failures
>>> PASS access
>>>
>>> real	7m10.063s
>>> user	7m9.063s
>>> sys	0m0.309s
>>>
>>> Fix the problem by adding few more limit checks.
>>
>> Please state somewhere in the changelog what is actually being changed, and the
>> actual effect of the change.  E.g.
>>
>>   Disallow protection keys testcase in combination with reserved bit
>>   testcasess to further limit the number of tests run to avoid timeouts on
>>   systems with support for protection keys.
>>
>>   Disallowing this combination reduces the total number of tests from
>>   58982405 to <???>, and the runtime from ~7 minutes to <???>
> 
> Sure. Will do.
>>
>>> Signed-off-by: Babu Moger <Babu.Moger@amd.com>
>>> ---
>>>  x86/access.c |    4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/x86/access.c b/x86/access.c
>>> index 47807cc..e371dd5 100644
>>> --- a/x86/access.c
>>> +++ b/x86/access.c
>>> @@ -317,9 +317,9 @@ static _Bool ac_test_legal(ac_test_t *at)
>>>      /*
>>>       * Shorten the test by avoiding testing too many reserved bit combinations
>>>       */
>>> -    if ((F(AC_PDE_BIT51) + F(AC_PDE_BIT36) + F(AC_PDE_BIT13)) > 1)
>>> +    if ((F(AC_PDE_BIT51) + F(AC_PDE_BIT36) + F(AC_PDE_BIT13) + F(AC_CPU_CR4_PKE)) > 1)
>>>          return false;
>>> -    if ((F(AC_PTE_BIT51) + F(AC_PTE_BIT36)) > 1)
>>> +    if ((F(AC_PTE_BIT51) + F(AC_PTE_BIT36) + F(AC_CPU_CR4_PKE)) > 1)
>>
>> Why are protection keys the sacrifical lamb?  Simply because they're the newest?
> 
> Yes. I added it because it was new ):.
>>
>> And before we start killing off arbitrary combinations, what about sharding the
>> test so that testcases that depend on a specific CR0/CR4/EFER bit, i.e. trigger
>> a VM-Exit when the configuration changes, are separate runs?  Being able to run
>> a specific combination would also hopefully make it easier to debug issues as
>> the user could specify which combo to run without having to modify the code and
>> recompile.
>>
>> That probably won't actually reduce the total run time, but it would make each
>> run a separate test and give developers a warm fuzzy feeling that they're indeed
>> making progress :-)
>>
>> Not sure how this could be automagically expressed this in unittest.cfg though...

As we know now that we cannot run a huge number of tests without running
into timeout, I was thinking of adding a extra parameter "max_runs" for
these tests and add a check in ac_test_run to limit the number of runs.
The max_runs will be set to default 10000000. But it can be changed in
unittests.cfg. Something like this.

[access]
 file = access.flat
 arch = x86_64
-extra_params = -cpu max
+extra_params = -cpu max -append 10000000
 timeout = 180

Thoughts?

> 
> Let me investigate if we can do that fairly easy. Will let you know.
> Thanks
> Babu
>>
>>>          return false;
>>>  
>>>      return true;
>>>
Babu Moger Aug. 10, 2021, 11:38 p.m. UTC | #3
On 8/10/21 11:59 AM, Babu Moger wrote:
> 
> 
> On 8/9/21 2:43 PM, Babu Moger wrote:
>>
>>
>> On 8/6/21 11:53 AM, Sean Christopherson wrote:
>>> On Fri, Aug 06, 2021, Babu Moger wrote:
>>>> From: Babu Moger <Babu.Moger@amd.com>
>>>>
>>>> The test ./x86/access fails with a timeout. This is due to the number test
>>>> combination. The test cases increase exponentially as the features get
>>>> enabled. The new machine adds the feature AC_CPU_CR4_PKE. The default
>>>> timeout is 180 seconds. Seen this problem both on AMD and Intel machines.
>>>>
>>>> #./tests/access
>>>> qemu-system-x86_64: terminating on signal 15 from pid 20050 (timeout)
>>>> FAIL access (timeout; duration=180)
>>>>
>>>> This test can take about 7 minutes without timeout.
>>>> time ./tests/access
>>>> 58982405 tests, 0 failures
>>>> PASS access
>>>>
>>>> real	7m10.063s
>>>> user	7m9.063s
>>>> sys	0m0.309s
>>>>
>>>> Fix the problem by adding few more limit checks.
>>>
>>> Please state somewhere in the changelog what is actually being changed, and the
>>> actual effect of the change.  E.g.
>>>
>>>   Disallow protection keys testcase in combination with reserved bit
>>>   testcasess to further limit the number of tests run to avoid timeouts on
>>>   systems with support for protection keys.
>>>
>>>   Disallowing this combination reduces the total number of tests from
>>>   58982405 to <???>, and the runtime from ~7 minutes to <???>
>>
>> Sure. Will do.
>>>
>>>> Signed-off-by: Babu Moger <Babu.Moger@amd.com>
>>>> ---
>>>>  x86/access.c |    4 ++--
>>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/x86/access.c b/x86/access.c
>>>> index 47807cc..e371dd5 100644
>>>> --- a/x86/access.c
>>>> +++ b/x86/access.c
>>>> @@ -317,9 +317,9 @@ static _Bool ac_test_legal(ac_test_t *at)
>>>>      /*
>>>>       * Shorten the test by avoiding testing too many reserved bit combinations
>>>>       */
>>>> -    if ((F(AC_PDE_BIT51) + F(AC_PDE_BIT36) + F(AC_PDE_BIT13)) > 1)
>>>> +    if ((F(AC_PDE_BIT51) + F(AC_PDE_BIT36) + F(AC_PDE_BIT13) + F(AC_CPU_CR4_PKE)) > 1)
>>>>          return false;
>>>> -    if ((F(AC_PTE_BIT51) + F(AC_PTE_BIT36)) > 1)
>>>> +    if ((F(AC_PTE_BIT51) + F(AC_PTE_BIT36) + F(AC_CPU_CR4_PKE)) > 1)
>>>
>>> Why are protection keys the sacrifical lamb?  Simply because they're the newest?
>>
>> Yes. I added it because it was new ):.
>>>
>>> And before we start killing off arbitrary combinations, what about sharding the
>>> test so that testcases that depend on a specific CR0/CR4/EFER bit, i.e. trigger
>>> a VM-Exit when the configuration changes, are separate runs?  Being able to run
>>> a specific combination would also hopefully make it easier to debug issues as
>>> the user could specify which combo to run without having to modify the code and
>>> recompile.
>>>
>>> That probably won't actually reduce the total run time, but it would make each
>>> run a separate test and give developers a warm fuzzy feeling that they're indeed
>>> making progress :-)
>>>
>>> Not sure how this could be automagically expressed this in unittest.cfg though...
> 
> As we know now that we cannot run a huge number of tests without running
> into timeout, I was thinking of adding a extra parameter "max_runs" for
> these tests and add a check in ac_test_run to limit the number of runs.
> The max_runs will be set to default 10000000. But it can be changed in
> unittests.cfg. Something like this.
> 
> [access]
>  file = access.flat
>  arch = x86_64
> -extra_params = -cpu max
> +extra_params = -cpu max -append 10000000

No. This will not work. The PKU feature flag is bit 30. That is 2^30
iterations to cover the tests for this feature. Looks like I need to split
the tests into PKU and non PKU tests. For PKU tests I may need to change
the bump frequency (in ac_test_bump_one) to much higher value. Right now,
it is 1. Let me try that,

>  timeout = 180
> 
> Thoughts?
> 
>>
>> Let me investigate if we can do that fairly easy. Will let you know.
>> Thanks
>> Babu
>>>
>>>>          return false;
>>>>  
>>>>      return true;
>>>>
Paolo Bonzini Aug. 11, 2021, 7:09 a.m. UTC | #4
On 11/08/21 01:38, Babu Moger wrote:
> No. This will not work. The PKU feature flag is bit 30. That is 2^30
> iterations to cover the tests for this feature. Looks like I need to split
> the tests into PKU and non PKU tests. For PKU tests I may need to change
> the bump frequency (in ac_test_bump_one) to much higher value. Right now,
> it is 1. Let me try that,

The simplest way to cut on tests, which is actually similar to this 
patch, would be:

- do not try all combinations of PTE access bits when reserved bits are set

- do not try combinations with more than one reserved bit set

Paolo
Babu Moger Aug. 11, 2021, 4:03 p.m. UTC | #5
On 8/11/21 2:09 AM, Paolo Bonzini wrote:
> On 11/08/21 01:38, Babu Moger wrote:
>> No. This will not work. The PKU feature flag is bit 30. That is 2^30
>> iterations to cover the tests for this feature. Looks like I need to split
>> the tests into PKU and non PKU tests. For PKU tests I may need to change
>> the bump frequency (in ac_test_bump_one) to much higher value. Right now,
>> it is 1. Let me try that,
> 
> The simplest way to cut on tests, which is actually similar to this patch,
> would be:
> 
> - do not try all combinations of PTE access bits when reserved bits are set
> 
> - do not try combinations with more than one reserved bit set

Did you mean this? Just doing this reduces the combination by huge number.
I don't need to add your first PTE access combinations.

diff --git a/x86/access.c b/x86/access.c
index 47807cc..a730b6b 100644
--- a/x86/access.c
+++ b/x86/access.c
@@ -317,9 +317,7 @@ static _Bool ac_test_legal(ac_test_t *at)
     /*
      * Shorten the test by avoiding testing too many reserved bit
combinations
      */
-    if ((F(AC_PDE_BIT51) + F(AC_PDE_BIT36) + F(AC_PDE_BIT13)) > 1)
-        return false;
-    if ((F(AC_PTE_BIT51) + F(AC_PTE_BIT36)) > 1)
+    if ((F(AC_PDE_BIT51) + F(AC_PDE_BIT36) + F(AC_PDE_BIT13) +
F(AC_PTE_BIT51) + F(AC_PTE_BIT36)) > 1)
         return false;

     return true;
Sean Christopherson Aug. 11, 2021, 4:13 p.m. UTC | #6
On Wed, Aug 11, 2021, Babu Moger wrote:
> 
> On 8/11/21 2:09 AM, Paolo Bonzini wrote:
> > On 11/08/21 01:38, Babu Moger wrote:
> >> No. This will not work. The PKU feature flag is bit 30. That is 2^30
> >> iterations to cover the tests for this feature. Looks like I need to split
> >> the tests into PKU and non PKU tests. For PKU tests I may need to change
> >> the bump frequency (in ac_test_bump_one) to much higher value. Right now,
> >> it is 1. Let me try that,
> > 
> > The simplest way to cut on tests, which is actually similar to this patch,
> > would be:
> > 
> > - do not try all combinations of PTE access bits when reserved bits are set
> > 
> > - do not try combinations with more than one reserved bit set
> 
> Did you mean this? Just doing this reduces the combination by huge number.
> I don't need to add your first PTE access combinations.
> 
> diff --git a/x86/access.c b/x86/access.c
> index 47807cc..a730b6b 100644
> --- a/x86/access.c
> +++ b/x86/access.c
> @@ -317,9 +317,7 @@ static _Bool ac_test_legal(ac_test_t *at)
>      /*
>       * Shorten the test by avoiding testing too many reserved bit
> combinations
>       */
> -    if ((F(AC_PDE_BIT51) + F(AC_PDE_BIT36) + F(AC_PDE_BIT13)) > 1)
> -        return false;
> -    if ((F(AC_PTE_BIT51) + F(AC_PTE_BIT36)) > 1)
> +    if ((F(AC_PDE_BIT51) + F(AC_PDE_BIT36) + F(AC_PDE_BIT13) +
> F(AC_PTE_BIT51) + F(AC_PTE_BIT36)) > 1)
>          return false;
> 
>      return true;

Looks good to me, is it sufficient to keep the overall runtime sane?.  And maybe
update the comment too, e.g. something like

	/*
	 * Skip testing multiple reserved bits to shorten the test.  Reserved
	 * bit page faults are terminal and multiple reserved bits do not affect
	 * the error code; the odds of a KVM bug are super low, and the odds of
	 * actually being able to detect a bug are even lower.
	 */
Babu Moger Aug. 11, 2021, 4:43 p.m. UTC | #7
On 8/11/21 11:13 AM, Sean Christopherson wrote:
> On Wed, Aug 11, 2021, Babu Moger wrote:
>>
>> On 8/11/21 2:09 AM, Paolo Bonzini wrote:
>>> On 11/08/21 01:38, Babu Moger wrote:
>>>> No. This will not work. The PKU feature flag is bit 30. That is 2^30
>>>> iterations to cover the tests for this feature. Looks like I need to split
>>>> the tests into PKU and non PKU tests. For PKU tests I may need to change
>>>> the bump frequency (in ac_test_bump_one) to much higher value. Right now,
>>>> it is 1. Let me try that,
>>>
>>> The simplest way to cut on tests, which is actually similar to this patch,
>>> would be:
>>>
>>> - do not try all combinations of PTE access bits when reserved bits are set
>>>
>>> - do not try combinations with more than one reserved bit set
>>
>> Did you mean this? Just doing this reduces the combination by huge number.
>> I don't need to add your first PTE access combinations.
>>
>> diff --git a/x86/access.c b/x86/access.c
>> index 47807cc..a730b6b 100644
>> --- a/x86/access.c
>> +++ b/x86/access.c
>> @@ -317,9 +317,7 @@ static _Bool ac_test_legal(ac_test_t *at)
>>      /*
>>       * Shorten the test by avoiding testing too many reserved bit
>> combinations
>>       */
>> -    if ((F(AC_PDE_BIT51) + F(AC_PDE_BIT36) + F(AC_PDE_BIT13)) > 1)
>> -        return false;
>> -    if ((F(AC_PTE_BIT51) + F(AC_PTE_BIT36)) > 1)
>> +    if ((F(AC_PDE_BIT51) + F(AC_PDE_BIT36) + F(AC_PDE_BIT13) +
>> F(AC_PTE_BIT51) + F(AC_PTE_BIT36)) > 1)
>>          return false;
>>
>>      return true;
> 
> Looks good to me, is it sufficient to keep the overall runtime sane?.  And maybe

It keeps the running time about 2 minutes.

> update the comment too, e.g. something like
> 
> 	/*
> 	 * Skip testing multiple reserved bits to shorten the test.  Reserved
> 	 * bit page faults are terminal and multiple reserved bits do not affect
> 	 * the error code; the odds of a KVM bug are super low, and the odds of
> 	 * actually being able to detect a bug are even lower.
> 	 */
> 

Sure. Will update the commit log.
Thanks
Babu
diff mbox series

Patch

diff --git a/x86/access.c b/x86/access.c
index 47807cc..e371dd5 100644
--- a/x86/access.c
+++ b/x86/access.c
@@ -317,9 +317,9 @@  static _Bool ac_test_legal(ac_test_t *at)
     /*
      * Shorten the test by avoiding testing too many reserved bit combinations
      */
-    if ((F(AC_PDE_BIT51) + F(AC_PDE_BIT36) + F(AC_PDE_BIT13)) > 1)
+    if ((F(AC_PDE_BIT51) + F(AC_PDE_BIT36) + F(AC_PDE_BIT13) + F(AC_CPU_CR4_PKE)) > 1)
         return false;
-    if ((F(AC_PTE_BIT51) + F(AC_PTE_BIT36)) > 1)
+    if ((F(AC_PTE_BIT51) + F(AC_PTE_BIT36) + F(AC_CPU_CR4_PKE)) > 1)
         return false;
 
     return true;