diff mbox

[v2,13/13] fuzz/x86_emulate: Add an option to limit the number of instructions executed

Message ID 20170925142648.25959-13-george.dunlap@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

George Dunlap Sept. 25, 2017, 2:26 p.m. UTC
AFL considers a testcase to be a useful addition not only if there are
tuples exercised by that testcase which were not exercised otherwise,
but also if the *number* of times an individual tuple is exercised
changes significantly; in particular, if the number of the highes bit
changes (i.e., if it is run 1, 2-3, 4-7, 8-15, &c).

Unfortunately, one simple way to increase these stats it to execute
the same (or similar) instructions multiple times.  Such long
testcases take exponentially longer to fuzz: the fuzzer spends more
time flipping bits looking for meaningful changes, and each execution
takes longer because it is doing more things.  So long paths which add
nothing to the actual code coverage but effectively "distract" the
fuzzer, making it less effective.

Experiments have shown that not allowing infinite number of
instruction retries for the old (non-compact) format does indeed speed
up and increase code coverage.  However, it has also shown that on the
new, more compact format, having no instruction limit causes the highest
throughput in code coverage.

So leave the option in, but have it default to 0 (no limit).

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
---
CC: Ian Jackson <ian.jackson@citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
---
 tools/fuzz/x86_instruction_emulator/afl-harness.c | 9 ++++++++-
 tools/fuzz/x86_instruction_emulator/fuzz-emul.c   | 7 ++++++-
 2 files changed, 14 insertions(+), 2 deletions(-)

Comments

Jan Beulich Oct. 4, 2017, 8:28 a.m. UTC | #1
>>> On 25.09.17 at 16:26, <george.dunlap@citrix.com> wrote:
> AFL considers a testcase to be a useful addition not only if there are
> tuples exercised by that testcase which were not exercised otherwise,
> but also if the *number* of times an individual tuple is exercised
> changes significantly; in particular, if the number of the highes bit
> changes (i.e., if it is run 1, 2-3, 4-7, 8-15, &c).

Perhaps I simply don't know about AFL (yet) to understand how "highest
bit" matters here, or even whose highest bits there's talk of.

> Unfortunately, one simple way to increase these stats it to execute
> the same (or similar) instructions multiple times.

But the change here doesn't look at instruction similarity at all.

> --- a/tools/fuzz/x86_instruction_emulator/fuzz-emul.c
> +++ b/tools/fuzz/x86_instruction_emulator/fuzz-emul.c
> @@ -960,10 +960,13 @@ void setup_fuzz_state(struct fuzz_state *state, const uint8_t *data_p, size_t si
>      state->data_num = size;
>  }
>  
> +int opt_instruction_limit = 0;

unsigned int (and formally no need for an initializer)

>  int runtest(struct fuzz_state *state) {
>      int rc;
>  
>      struct x86_emulate_ctxt *ctxt = &state->ctxt;
> +    int icount = 0;

unsigned int

> @@ -988,7 +991,9 @@ int runtest(struct fuzz_state *state) {
>  
>          rc = x86_emulate(ctxt, &state->ops);
>          printf("Emulation result: %d\n", rc);
> -    } while ( rc == X86EMUL_OKAY );
> +    } while ( rc == X86EMUL_OKAY &&
> +              (!opt_instruction_limit ||
> +               (++icount < opt_instruction_limit)) );

Hmm, if the initalizer of opt_instruction_limit was UINT_MAX, I think
this wouldn't severely impact results (running 4 billion emulations is
simply going to take too long) and this expression could be a simple
comparison.

Jan
George Dunlap Oct. 6, 2017, 10:40 a.m. UTC | #2
On 10/04/2017 09:28 AM, Jan Beulich wrote:
>>>> On 25.09.17 at 16:26, <george.dunlap@citrix.com> wrote:
>> AFL considers a testcase to be a useful addition not only if there are
>> tuples exercised by that testcase which were not exercised otherwise,
>> but also if the *number* of times an individual tuple is exercised
>> changes significantly; in particular, if the number of the highes bit
>> changes (i.e., if it is run 1, 2-3, 4-7, 8-15, &c).
> 
> Perhaps I simply don't know about AFL (yet) to understand how "highest
> bit" matters here, or even whose highest bits there's talk of.

Probably the easiest way to get this would be to read the
'technical_details.txt' [1] document about AFL, specifically the section
"Detecting new behaviors".  The section isn't long, and I'm not sure I
could explain the situation more concisely than the author has there.

[1] http://lcamtuf.coredump.cx/afl/technical_details.txt


>> Unfortunately, one simple way to increase these stats it to execute
>> the same (or similar) instructions multiple times.
> 
> But the change here doesn't look at instruction similarity at all.

I'm talking about how blind changes AFL makes to the input affect what
AFL sees at the "output".

Suppose it has a testcase where instruction A is executed once, and it
sees tuple N executed twice.  Now suppose it morphs the instruction so
instruction A is executed twice.  It will now see tuple N executed 4
times.  This is seen as 'new behavior', and so it will add that as a
'new' test case to its set of interesting things to fuzz.  Then suppose
it morphs one of those so that instruction A is executed four times.
Tuple N will be executed  8 times, which again is new behavior.  The
highest tuple count it sees as unique is 128; so in our example, it will
generate sample inputs up to 64 instructions -- even if the actual path
through the code for each instruction is identical to the
single-instruction one.

A 64-instruction test case will take at least 64x as long to execute as
a 1-instruction test case; and it will generally also take 64x as long
to fuzz.  This makes AFL is spending nearly 1000x as much time fuzzing
that test case as the 1-instruction test case, but for no very good
reason -- if you can't get actual new behavior we care about out of 2-3
instructions, you're not going to get it out of 60 instructions.

IOW, arbitrary numbers of instructions fool AFL into thinking it's found
something new and interesting when it hasn't.  Limiting the number of
instructions should in theory keep AFL from getting distracted with test
cases it thinks are new and unique but aren't.  And we see that for the
old format, this is true.

I suspect there's some number of instructions past which we get
diminishing returns even for the 'compact' format, but since testing
involves running things for 24 hours, there's also a diminishing returns
for that kind of optimization. :-)

>> --- a/tools/fuzz/x86_instruction_emulator/fuzz-emul.c
>> +++ b/tools/fuzz/x86_instruction_emulator/fuzz-emul.c
>> @@ -960,10 +960,13 @@ void setup_fuzz_state(struct fuzz_state *state, const uint8_t *data_p, size_t si
>>      state->data_num = size;
>>  }
>>  
>> +int opt_instruction_limit = 0;
> 
> unsigned int (and formally no need for an initializer)
> 
>>  int runtest(struct fuzz_state *state) {
>>      int rc;
>>  
>>      struct x86_emulate_ctxt *ctxt = &state->ctxt;
>> +    int icount = 0;
> 
> unsigned int

Ack

> 
>> @@ -988,7 +991,9 @@ int runtest(struct fuzz_state *state) {
>>  
>>          rc = x86_emulate(ctxt, &state->ops);
>>          printf("Emulation result: %d\n", rc);
>> -    } while ( rc == X86EMUL_OKAY );
>> +    } while ( rc == X86EMUL_OKAY &&
>> +              (!opt_instruction_limit ||
>> +               (++icount < opt_instruction_limit)) );
> 
> Hmm, if the initalizer of opt_instruction_limit was UINT_MAX, I think
> this wouldn't severely impact results (running 4 billion emulations is
> simply going to take too long) and this expression could be a simple
> comparison.

Yes, we could do that -- we'd have to change the argument parsing code
to handle that case instead, but that's probably a better trade-off.

And I don't have to argue about how having an initializer is easier to
understand what's going on even if it's not strictly necessary. :-)

 -George
Jan Beulich Oct. 6, 2017, 12:12 p.m. UTC | #3
>>> On 06.10.17 at 12:40, <george.dunlap@citrix.com> wrote:
> On 10/04/2017 09:28 AM, Jan Beulich wrote:
>>>>> On 25.09.17 at 16:26, <george.dunlap@citrix.com> wrote:
>>> AFL considers a testcase to be a useful addition not only if there are
>>> tuples exercised by that testcase which were not exercised otherwise,
>>> but also if the *number* of times an individual tuple is exercised
>>> changes significantly; in particular, if the number of the highes bit
>>> changes (i.e., if it is run 1, 2-3, 4-7, 8-15, &c).
>> 
>> Perhaps I simply don't know about AFL (yet) to understand how "highest
>> bit" matters here, or even whose highest bits there's talk of.
> 
> Probably the easiest way to get this would be to read the
> 'technical_details.txt' [1] document about AFL, specifically the section
> "Detecting new behaviors".  The section isn't long, and I'm not sure I
> could explain the situation more concisely than the author has there.

Having read that, I still don't see what "bit" you talk about. The
text there talks about "hit"s - is this perhaps just a typo here?

>>> Unfortunately, one simple way to increase these stats it to execute
>>> the same (or similar) instructions multiple times.
>> 
>> But the change here doesn't look at instruction similarity at all.
> 
> I'm talking about how blind changes AFL makes to the input affect what
> AFL sees at the "output".
> 
> Suppose it has a testcase where instruction A is executed once, and it
> sees tuple N executed twice.  Now suppose it morphs the instruction so
> instruction A is executed twice.  It will now see tuple N executed 4
> times.  This is seen as 'new behavior', and so it will add that as a
> 'new' test case to its set of interesting things to fuzz.  Then suppose
> it morphs one of those so that instruction A is executed four times.
> Tuple N will be executed  8 times, which again is new behavior.  The
> highest tuple count it sees as unique is 128; so in our example, it will
> generate sample inputs up to 64 instructions -- even if the actual path
> through the code for each instruction is identical to the
> single-instruction one.
> 
> A 64-instruction test case will take at least 64x as long to execute as
> a 1-instruction test case; and it will generally also take 64x as long
> to fuzz.  This makes AFL is spending nearly 1000x as much time fuzzing
> that test case as the 1-instruction test case, but for no very good
> reason -- if you can't get actual new behavior we care about out of 2-3
> instructions, you're not going to get it out of 60 instructions.
> 
> IOW, arbitrary numbers of instructions fool AFL into thinking it's found
> something new and interesting when it hasn't.  Limiting the number of
> instructions should in theory keep AFL from getting distracted with test
> cases it thinks are new and unique but aren't.  And we see that for the
> old format, this is true.
> 
> I suspect there's some number of instructions past which we get
> diminishing returns even for the 'compact' format, but since testing
> involves running things for 24 hours, there's also a diminishing returns
> for that kind of optimization. :-)

All understood, yet I still don't understand why you say "the
same (or similar)" when really you only care to limit instruction
count. This is the more that the same insn executed with
different inputs can have dramatically different effects (most
severe case probably being no exception vs exception).

Jan
George Dunlap Oct. 6, 2017, 1:02 p.m. UTC | #4
On 10/06/2017 01:12 PM, Jan Beulich wrote:
>>>> On 06.10.17 at 12:40, <george.dunlap@citrix.com> wrote:
>> On 10/04/2017 09:28 AM, Jan Beulich wrote:
>>>>>> On 25.09.17 at 16:26, <george.dunlap@citrix.com> wrote:
>>>> AFL considers a testcase to be a useful addition not only if there are
>>>> tuples exercised by that testcase which were not exercised otherwise,
>>>> but also if the *number* of times an individual tuple is exercised
>>>> changes significantly; in particular, if the number of the highes bit
>>>> changes (i.e., if it is run 1, 2-3, 4-7, 8-15, &c).
>>>
>>> Perhaps I simply don't know about AFL (yet) to understand how "highest
>>> bit" matters here, or even whose highest bits there's talk of.
>>
>> Probably the easiest way to get this would be to read the
>> 'technical_details.txt' [1] document about AFL, specifically the section
>> "Detecting new behaviors".  The section isn't long, and I'm not sure I
>> could explain the situation more concisely than the author has there.
> 
> Having read that, I still don't see what "bit" you talk about. The
> text there talks about "hit"s - is this perhaps just a typo here?

I'm sorry my wording wasn't very precise, but I don't understand why the
examples in parentheses don't communicate what I mean.  Here by "highest
bit" I basically meant, the highest bit which is non-zero.  For 2 and 3,
the number of the highest non-zero bit is 2.  For 4, 5, 6, and 7, the
number of the highest non-zero bit is 3.  For 8-15, the number of the
highest non-zero bit is 4, and so on.

If two test cases touch the exact same branch tuples, but the order of
at least one the counts is different (i.e., if the number of the highest
non-zero bit is different), then AFL considers them as behaving differently.

>>>> Unfortunately, one simple way to increase these stats it to execute
>>>> the same (or similar) instructions multiple times.
>>>
>>> But the change here doesn't look at instruction similarity at all.
>>
>> I'm talking about how blind changes AFL makes to the input affect what
>> AFL sees at the "output".
>>
>> Suppose it has a testcase where instruction A is executed once, and it
>> sees tuple N executed twice.  Now suppose it morphs the instruction so
>> instruction A is executed twice.  It will now see tuple N executed 4
>> times.  This is seen as 'new behavior', and so it will add that as a
>> 'new' test case to its set of interesting things to fuzz.  Then suppose
>> it morphs one of those so that instruction A is executed four times.
>> Tuple N will be executed  8 times, which again is new behavior.  The
>> highest tuple count it sees as unique is 128; so in our example, it will
>> generate sample inputs up to 64 instructions -- even if the actual path
>> through the code for each instruction is identical to the
>> single-instruction one.
>>
>> A 64-instruction test case will take at least 64x as long to execute as
>> a 1-instruction test case; and it will generally also take 64x as long
>> to fuzz.  This makes AFL is spending nearly 1000x as much time fuzzing
>> that test case as the 1-instruction test case, but for no very good
>> reason -- if you can't get actual new behavior we care about out of 2-3
>> instructions, you're not going to get it out of 60 instructions.
>>
>> IOW, arbitrary numbers of instructions fool AFL into thinking it's found
>> something new and interesting when it hasn't.  Limiting the number of
>> instructions should in theory keep AFL from getting distracted with test
>> cases it thinks are new and unique but aren't.  And we see that for the
>> old format, this is true.
>>
>> I suspect there's some number of instructions past which we get
>> diminishing returns even for the 'compact' format, but since testing
>> involves running things for 24 hours, there's also a diminishing returns
>> for that kind of optimization. :-)
> 
> All understood, yet I still don't understand why you say "the
> same (or similar)" when really you only care to limit instruction
> count. This is the more that the same insn executed with
> different inputs can have dramatically different effects (most
> severe case probably being no exception vs exception).

I don't care to limit the instruction count.  I care to restrict AFL
from chasing false leads, thinking that it's discovered "unique" and
"interesting" behavior because it's managed to run an instruction 50
times instead of 5.

One way AFL can chase false leads is by executing *the exact same*
instruction a number of times in a row.  But many instructions which are
*similar* also trigger similar paths; so the same effect could happen
just as well with to instructions that touch nearly the same paths as
with a single instruction.

But I'm not sure what exceptions vs no exceptions has to do with it, so
perhaps I still don't understand you. :-)

 -George
diff mbox

Patch

diff --git a/tools/fuzz/x86_instruction_emulator/afl-harness.c b/tools/fuzz/x86_instruction_emulator/afl-harness.c
index 6b0f66f923..db6bb2891f 100644
--- a/tools/fuzz/x86_instruction_emulator/afl-harness.c
+++ b/tools/fuzz/x86_instruction_emulator/afl-harness.c
@@ -15,6 +15,7 @@  static uint8_t input[INPUT_SIZE];
 
 extern bool opt_compact;
 extern bool opt_rerun;
+extern int opt_instruction_limit;
 
 int main(int argc, char **argv)
 {
@@ -34,11 +35,13 @@  int main(int argc, char **argv)
             OPT_MIN_SIZE,
             OPT_COMPACT,
             OPT_RERUN,
+            OPT_INSTRUCTION_LIMIT,
         };
         static const struct option lopts[] = {
             { "min-input-size", no_argument, NULL, OPT_MIN_SIZE },
             { "compact", required_argument, NULL, OPT_COMPACT },
             { "rerun", no_argument, NULL, OPT_RERUN },
+            { "instruction-limit", required_argument, NULL, OPT_INSTRUCTION_LIMIT },
             { 0, 0, 0, 0 }
         };
         int c = getopt_long_only(argc, argv, "", lopts, NULL);
@@ -61,8 +64,12 @@  int main(int argc, char **argv)
             opt_rerun = true;
             break;
 
+        case OPT_INSTRUCTION_LIMIT:
+            opt_instruction_limit = atoi(optarg);
+            break;
+
         case '?':
-            printf("Usage: %s [--compact=0|1] [--rerun] $FILE [$FILE...] | [--min-input-size]\n", argv[0]);
+            printf("Usage: %s [--compact=0|1] [--rerun] [--instruction-limit=N] $FILE [$FILE...] | [--min-input-size]\n", argv[0]);
             exit(-1);
             break;
 
diff --git a/tools/fuzz/x86_instruction_emulator/fuzz-emul.c b/tools/fuzz/x86_instruction_emulator/fuzz-emul.c
index 48cad0307a..c2ab029347 100644
--- a/tools/fuzz/x86_instruction_emulator/fuzz-emul.c
+++ b/tools/fuzz/x86_instruction_emulator/fuzz-emul.c
@@ -960,10 +960,13 @@  void setup_fuzz_state(struct fuzz_state *state, const uint8_t *data_p, size_t si
     state->data_num = size;
 }
 
+int opt_instruction_limit = 0;
+
 int runtest(struct fuzz_state *state) {
     int rc;
 
     struct x86_emulate_ctxt *ctxt = &state->ctxt;
+    int icount = 0;
     
     state->ops = all_fuzzer_ops;
 
@@ -988,7 +991,9 @@  int runtest(struct fuzz_state *state) {
 
         rc = x86_emulate(ctxt, &state->ops);
         printf("Emulation result: %d\n", rc);
-    } while ( rc == X86EMUL_OKAY );
+    } while ( rc == X86EMUL_OKAY &&
+              (!opt_instruction_limit ||
+               (++icount < opt_instruction_limit)) );
 
     save_fpu_state(state->fxsave);