diff mbox series

[v6,1/3] KVM: selftests: implement random number generation for guest code

Message ID 20220912195849.3989707-2-coltonlewis@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: selftests: randomize memory access of dirty_log_perf_test | expand

Commit Message

Colton Lewis Sept. 12, 2022, 7:58 p.m. UTC
Implement random number generation for guest code to randomize parts
of the test, making it less predictable and a more accurate reflection
of reality.

Create a -r argument to specify a random seed. If no argument is
provided, the seed defaults to 0. The random seed is set with
perf_test_set_random_seed() and must be set before guest_code runs to
apply.

The random number generator chosen is the Park-Miller Linear
Congruential Generator, a fancy name for a basic and well-understood
random number generator entirely sufficient for this purpose. Each
vCPU calculates its own seed by adding its index to the seed provided.

Signed-off-by: Colton Lewis <coltonlewis@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
---
 tools/testing/selftests/kvm/dirty_log_perf_test.c    | 12 ++++++++++--
 tools/testing/selftests/kvm/include/perf_test_util.h |  2 ++
 tools/testing/selftests/kvm/include/test_util.h      |  2 ++
 tools/testing/selftests/kvm/lib/perf_test_util.c     |  8 ++++++++
 tools/testing/selftests/kvm/lib/test_util.c          |  9 +++++++++
 5 files changed, 31 insertions(+), 2 deletions(-)

Comments

Sean Christopherson Oct. 7, 2022, 8:41 p.m. UTC | #1
On Mon, Sep 12, 2022, Colton Lewis wrote:
> diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
> index 99e0dcdc923f..2dd286bcf46f 100644
> --- a/tools/testing/selftests/kvm/include/test_util.h
> +++ b/tools/testing/selftests/kvm/include/test_util.h
> @@ -143,4 +143,6 @@ static inline void *align_ptr_up(void *x, size_t size)
>  	return (void *)align_up((unsigned long)x, size);
>  }
>  
> +void guest_random(uint32_t *seed);

This is a weird and unintuitive API.  The in/out param exposes the gory details
of the pseudo-RNG to the caller, and makes it cumbersome to use, e.g. to create
a 64-bit number or to consume the result in a conditional.

It's also not inherently guest-specific, or even KVM specific.  We should consider
landing this in common selftests code so that others can use it and even expand on
it.  E.g. in a previous life, I worked with a tool that implemented all sorts of
random number magic that provided APIs to get random bools with 1->99 probabilty,
random numbers along Guassian curves, bounded numbers, etc.

We definitely don't need that level of fanciness right way, but it doesn't require
much effort to define the initial APIs such that they won't require modification
if/when fancier usage comes along.

E.g. to start:

---
.c:

/*
 * Random number generator that is usable in any context.  Not thread-safe.
 * This is the Park-Miller LCG using standard constants.
 */
struct ksft_pseudo_rng {
	uint32_t state;
}

void ksft_pseudo_rng_init(struct ksft_pseudo_rng *rng, uint64_t seed)
{
	rng->state = seed;	
}

static uint32_t ksft_pseudo_rng_random(struct ksft_pseudo_rng *rng)
{

	*rng->state = (uint64_t)*rng->state * 48271 % ((1u << 31) - 1);
	return *rng->state;
}

uint32_t random_u32(struct ksft_pseudo_rng *rng)
{
	return ksft_pseudo_rng_random(rng);
}

uint64_t random_u64(struct ksft_pseudo_rng *rng)
{
	return (uint64_t)random_u32(rng) << 32 | random_u32(rng);
}

.h

struct ksft_pseudo_rng;

void ksft_pseudo_rng_init(struct ksft_pseudo_rng *rng, uint64_t seed);
uint32_t random_u32(struct ksft_pseudo_rng *rng);
uint64_t random_u64(struct ksft_pseudo_rng *rng);
---

And then if/when want to add fancier stuff, say bounded 32-bit values, we can do
so without having to update existing users.
Sean Christopherson Oct. 10, 2022, 6:03 p.m. UTC | #2
On Mon, Sep 12, 2022, Colton Lewis wrote:
> Implement random number generation for guest code to randomize parts
> of the test, making it less predictable and a more accurate reflection
> of reality.
> 
> Create a -r argument to specify a random seed. If no argument is
> provided, the seed defaults to 0. The random seed is set with
> perf_test_set_random_seed() and must be set before guest_code runs to
> apply.
> 
> The random number generator chosen is the Park-Miller Linear
> Congruential Generator, a fancy name for a basic and well-understood
> random number generator entirely sufficient for this purpose. Each
> vCPU calculates its own seed by adding its index to the seed provided.

Why not grab the kernel's pseudo-RNG from prandom_seed_state() + prandom_u32_state()?
Sean Christopherson Oct. 11, 2022, 6:26 p.m. UTC | #3
On Tue, Oct 11, 2022, Colton Lewis wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Mon, Sep 12, 2022, Colton Lewis wrote:
> > > Implement random number generation for guest code to randomize parts
> > > of the test, making it less predictable and a more accurate reflection
> > > of reality.
> 
> > > Create a -r argument to specify a random seed. If no argument is
> > > provided, the seed defaults to 0. The random seed is set with
> > > perf_test_set_random_seed() and must be set before guest_code runs to
> > > apply.
> 
> > > The random number generator chosen is the Park-Miller Linear
> > > Congruential Generator, a fancy name for a basic and well-understood
> > > random number generator entirely sufficient for this purpose. Each
> > > vCPU calculates its own seed by adding its index to the seed provided.
> 
> > Why not grab the kernel's pseudo-RNG from prandom_seed_state() +
> > prandom_u32_state()?
> 
> The guest is effectively a minimal kernel running in a VM that doesn't
> have access to this, correct?

Oh, I didn't mean link to the kernel code, I meant "why not copy+paste the kernel
code?".  In other words, why select a different RNG implementation than what the
kernel uses?  In general, selftests and tools try to follow the kernel code, even
when copy+pasting, as that avoids questions like "why does the kernel do X but
selftests do Y?".  The copy+paste does sometimes lead to maintenance pain, e.g. if
the copied code has a bug that the kernel then fixes, but that seems unlikely to
happen in this case.
Sean Christopherson Oct. 11, 2022, 6:39 p.m. UTC | #4
On Tue, Oct 11, 2022, Colton Lewis wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Mon, Sep 12, 2022, Colton Lewis wrote:
> > > diff --git a/tools/testing/selftests/kvm/include/test_util.h
> > > b/tools/testing/selftests/kvm/include/test_util.h
> > > index 99e0dcdc923f..2dd286bcf46f 100644
> > > --- a/tools/testing/selftests/kvm/include/test_util.h
> > > +++ b/tools/testing/selftests/kvm/include/test_util.h
> > > @@ -143,4 +143,6 @@ static inline void *align_ptr_up(void *x, size_t
> > > size)
> > >   	return (void *)align_up((unsigned long)x, size);
> > >   }
> 
> > > +void guest_random(uint32_t *seed);
> 
> > This is a weird and unintuitive API.  The in/out param exposes the gory
> > details of the pseudo-RNG to the caller, and makes it cumbersome to use,
> > e.g. to create a 64-bit number or to consume the result in a conditional.
> 
> To explain my reasoning:
> 
> It's simple because there is exactly one way to use it and it's short so
> anyone can understand how the function works at a glance. It's similar
> to the API used by other thread-safe RNGs like rand_r and random_r. They
> also use in/out parameters. That's the only way to buy thread
> safety. Callers would also have to manage their own state in your
> example with an in/out parameter if they want thread safety.
> 
> I disagree the details are gory. You put in a number and get a new
> number.

Regardless of whether or not the details are gory, having to be aware of those
details unnecessarily impedes understanding the code.  The vast, vast majority of
folks that read this code won't care about how PRNGs work.  Even if the reader is
familiar with PRNGs, those details aren't at all relevant to understanding what
the guest code does.  The reader only needs to know "oh, this is randomizing the
address".  How the randomization works is completely irrelevant for that level of
understanding.

> It's common knowledge PRNGs work this way.

For people that are familiar with PRNGs, yes, but there will undoubtedly be people
that read this code and have practically zero experience with PRNGs.

> I understand you are thinking about ease of future extensions, but this
> strikes me as premature abstraction. Additional APIs can always be added
> later for the fancy stuff without modifying the callers that don't need it.
> 
> I agree returning the value could make it easier to use as part of
> expressions, but is it that important?

Yes, because in pretty much every use case, the random number is going to be
immediately consumed.  Readability and robustness aside, returning the value cuts
the amount of code need to generate and consume a random number in half.

> > It's also not inherently guest-specific, or even KVM specific.  We
> > should consider
> > landing this in common selftests code so that others can use it and even
> > expand on
> > it.  E.g. in a previous life, I worked with a tool that implemented all
> > sorts of
> > random number magic that provided APIs to get random bools with 1->99
> > probabilty,
> > random numbers along Guassian curves, bounded numbers, etc.
> 
> 
> People who need random numbers outside the guest should use stdlib.h. No
> need to reimplement a full random library. The point of this
> implementation was to do the simplest thing that could provide random
> numbers to the guest.

Ah, right.
Sean Christopherson Oct. 11, 2022, 11:47 p.m. UTC | #5
On Tue, Oct 11, 2022, Colton Lewis wrote:
> Sean Christopherson <seanjc@google.com> writes:
> > Regardless of whether or not the details are gory, having to be aware of
> > those details unnecessarily impedes understanding the code.  The vast, vast
> > majority of folks that read this code won't care about how PRNGs work.
> > Even if the reader is familiar with PRNGs, those details aren't at all
> > relevant to understanding what the guest code does.  The reader only needs
> > to know "oh, this is randomizing the address".  How the randomization works
> > is completely irrelevant for that level of understanding.
> 
> It is relevant if the reader of the guest code cares about reentrancy
> and thread-safety (good for such things as reproducing the same stream
> of randoms from the same initial conditions), because they will have to
> manage some state to make that work. Whether that state is an integer or
> an opaque struct requires the same level of knowledge to use the API.

By "readers" I'm not (only) talking about developers actively writing code, I'm
I'm talking about people that run this test, encounter a failure, and read the code
to try and understand what went wrong.  For those people, knowing that the guest is
generating a random number is sufficient information during initial triage.  If the
failure ends up being related to the random number, then yes they'll likely need to
learn the details, but that's a few steps down the road.

> > > I agree returning the value could make it easier to use as part of
> > > expressions, but is it that important?
> 
> > Yes, because in pretty much every use case, the random number is going to
> > be immediately consumed.  Readability and robustness aside, returning the
> > value cuts the amount of code need to generate and consume a random number
> > in half.
> 
> Ok. This is trivial to change (or implement on top of what is
> there). Would you be happy with something like the following?
> 
> uint32_t guest_random(uint32_t *seed)
> {
> 	*seed = (uint64_t)*seed * 48271 % ((uint32_t)(1 << 31) - 1);
> 	return *seed;
> }

Honestly, no.  I truly don't understand the pushback on throwing the seed into
a struct and giving the helpers more precise names.  E.g. without looking at the
prototype, guest_random() tells the reader nothing about the size of the random
number returned, whereas <namespace>_random_u32() or <namespace>_get_random_u32()
(if we want to more closely follow kernel nomenclature) tells the reader exactly
what is returned.

As a contrived example, imagine a bug where the intent was to do 50/50 reads/writes,
but the test ends up doing almost all writes.  Code that looks like:

	if (guest_random(...))

is less obviously wrong than

	if (guest_random_u32(...))

even if the user knows nothing about how the random number is generated.

And aside from hiding details and providing abstraction for readers, throwing a
shim struct around the seed means that if we do want to change up the internal
implementation, then we can do so without having to churn users.

The code is trivial to write and I can't think of any meaningful downside.  Worst
case scenario, we end up with an implementation that is slightly more formal than
then we really need.
Sean Christopherson Oct. 12, 2022, 11:34 p.m. UTC | #6
On Wed, Oct 12, 2022, Colton Lewis wrote:
> Sean Christopherson <seanjc@google.com> writes:
> > The code is trivial to write and I can't think of any meaningful downside.
> > Worst case scenario, we end up with an implementation that is slightly more
> > formal than then we really need.
> 
> As a matter of personal taste, I don't like the additional formality
> making things look more complicated than they are. The stakes are small
> here but that kind of extra boilerplate can add up to make things
> confusing.

I agree about unnecessary boilerplate being a burden, especially when it comes to
KVM selftests, which are ridiculously "formal" and make simple operations
frustratingly difficult.

In this case though, I think the benefits of encapsulating the seed outweigh the
cost of the formality by a good margin, and I don't see that formality snowballing
any further.  A struct gets us:

  - Type checking on the input param, e.g. prevents passing in garbage for the seed.
  - The ability to switch out the algorithm.
  - Some protection against overwriting the seed, e.g. corrupting the struct pointer
    will explode and a memcpy() to overwrite the struct will be more visibily wrong.

> Thanks for your patience. I never wanted to cause trouble.

Heh, no worries.
diff mbox series

Patch

diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index d60a34cdfaee..a89a620f50d4 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -126,6 +126,7 @@  struct test_params {
 	bool partition_vcpu_memory_access;
 	enum vm_mem_backing_src_type backing_src;
 	int slots;
+	uint32_t random_seed;
 };
 
 static void toggle_dirty_logging(struct kvm_vm *vm, int slots, bool enable)
@@ -220,6 +221,9 @@  static void run_test(enum vm_guest_mode mode, void *arg)
 				 p->slots, p->backing_src,
 				 p->partition_vcpu_memory_access);
 
+	/* If no argument provided, random seed will be 0. */
+	pr_info("Random seed: %u\n", p->random_seed);
+	perf_test_set_random_seed(vm, p->random_seed);
 	perf_test_set_wr_fract(vm, p->wr_fract);
 
 	guest_num_pages = (nr_vcpus * guest_percpu_mem_size) >> vm_get_page_shift(vm);
@@ -337,7 +341,7 @@  static void help(char *name)
 {
 	puts("");
 	printf("usage: %s [-h] [-i iterations] [-p offset] [-g] "
-	       "[-m mode] [-n] [-b vcpu bytes] [-v vcpus] [-o] [-s mem type]"
+	       "[-m mode] [-n] [-b vcpu bytes] [-v vcpus] [-o] [-r random seed ] [-s mem type]"
 	       "[-x memslots]\n", name);
 	puts("");
 	printf(" -i: specify iteration counts (default: %"PRIu64")\n",
@@ -362,6 +366,7 @@  static void help(char *name)
 	printf(" -v: specify the number of vCPUs to run.\n");
 	printf(" -o: Overlap guest memory accesses instead of partitioning\n"
 	       "     them into a separate region of memory for each vCPU.\n");
+	printf(" -r: specify the starting random seed.\n");
 	backing_src_help("-s");
 	printf(" -x: Split the memory region into this number of memslots.\n"
 	       "     (default: 1)\n");
@@ -388,7 +393,7 @@  int main(int argc, char *argv[])
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "ghi:p:m:nb:f:v:os:x:")) != -1) {
+	while ((opt = getopt(argc, argv, "ghi:p:m:nb:f:v:or:s:x:")) != -1) {
 		switch (opt) {
 		case 'g':
 			dirty_log_manual_caps = 0;
@@ -421,6 +426,9 @@  int main(int argc, char *argv[])
 		case 'o':
 			p.partition_vcpu_memory_access = false;
 			break;
+		case 'r':
+			p.random_seed = atoi(optarg);
+			break;
 		case 's':
 			p.backing_src = parse_backing_src_type(optarg);
 			break;
diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h b/tools/testing/selftests/kvm/include/perf_test_util.h
index d822cb670f1c..f18530984b42 100644
--- a/tools/testing/selftests/kvm/include/perf_test_util.h
+++ b/tools/testing/selftests/kvm/include/perf_test_util.h
@@ -34,6 +34,7 @@  struct perf_test_args {
 	uint64_t gpa;
 	uint64_t size;
 	uint64_t guest_page_size;
+	uint32_t random_seed;
 	int wr_fract;
 
 	/* Run vCPUs in L2 instead of L1, if the architecture supports it. */
@@ -51,6 +52,7 @@  struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus,
 void perf_test_destroy_vm(struct kvm_vm *vm);
 
 void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract);
+void perf_test_set_random_seed(struct kvm_vm *vm, uint32_t random_seed);
 
 void perf_test_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct perf_test_vcpu_args *));
 void perf_test_join_vcpu_threads(int vcpus);
diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index 99e0dcdc923f..2dd286bcf46f 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -143,4 +143,6 @@  static inline void *align_ptr_up(void *x, size_t size)
 	return (void *)align_up((unsigned long)x, size);
 }
 
+void guest_random(uint32_t *seed);
+
 #endif /* SELFTEST_KVM_TEST_UTIL_H */
diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
index f989ff91f022..b1e731de0966 100644
--- a/tools/testing/selftests/kvm/lib/perf_test_util.c
+++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
@@ -47,6 +47,7 @@  void perf_test_guest_code(uint32_t vcpu_id)
 	uint64_t gva;
 	uint64_t pages;
 	int i;
+	uint32_t rand = pta->random_seed + vcpu_id;
 
 	/* Make sure vCPU args data structure is not corrupt. */
 	GUEST_ASSERT(vcpu_args->vcpu_id == vcpu_id);
@@ -57,6 +58,7 @@  void perf_test_guest_code(uint32_t vcpu_id)
 	while (true) {
 		for (i = 0; i < pages; i++) {
 			uint64_t addr = gva + (i * pta->guest_page_size);
+			guest_random(&rand);
 
 			if (i % pta->wr_fract == 0)
 				*(uint64_t *)addr = 0x0123456789ABCDEF;
@@ -224,6 +226,12 @@  void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract)
 	sync_global_to_guest(vm, perf_test_args);
 }
 
+void perf_test_set_random_seed(struct kvm_vm *vm, uint32_t random_seed)
+{
+	perf_test_args.random_seed = random_seed;
+	sync_global_to_guest(vm, perf_test_args.random_seed);
+}
+
 uint64_t __weak perf_test_nested_pages(int nr_vcpus)
 {
 	return 0;
diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c
index 6d23878bbfe1..28c895743abe 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -17,6 +17,15 @@ 
 
 #include "test_util.h"
 
+/*
+ * Random number generator that is usable from guest code. This is the
+ * Park-Miller LCG using standard constants.
+ */
+void guest_random(uint32_t *seed)
+{
+	*seed = (uint64_t)*seed * 48271 % ((uint32_t)(1 << 31) - 1);
+}
+
 /*
  * Parses "[0-9]+[kmgt]?".
  */