diff mbox series

selftests/rseq: take large C-state exit latency into consideration

Message ID 20240322163351.150673-1-zide.chen@intel.com (mailing list archive)
State New
Headers show
Series selftests/rseq: take large C-state exit latency into consideration | expand

Commit Message

Chen, Zide March 22, 2024, 4:33 p.m. UTC
Currently, the migration worker delays 1-10 us, assuming that one
KVM_RUN iteration only takes a few microseconds.  But if C-state exit
latencies are large enough, for example, hundreds or even thousands
of microseconds on server CPUs, it may happen that it's not able to
bring the target CPU out of C-state before the migration worker starts
to migrate it to the next CPU.

If the system workload is light, most CPUs could be at a certain level
of C-state, and the vCPU thread may waste milliseconds before it can
actually migrate to a new CPU.

Thus, the tests may be inefficient in such systems, and in some cases
it may fail the migration/KVM_RUN ratio sanity check.

Since we are not able to turn off the cpuidle sub-system in run time,
this patch creates an idle thread on every CPU to prevent them from
entering C-states.

Additionally, seems it's reasonable to randomize the length of usleep(),
other than delay in a fixed pattern.

Signed-off-by: Zide Chen <zide.chen@intel.com>
---
 tools/testing/selftests/kvm/rseq_test.c | 76 ++++++++++++++++++++++---
 1 file changed, 69 insertions(+), 7 deletions(-)

Comments

Sean Christopherson April 5, 2024, 11:01 p.m. UTC | #1
On Fri, Mar 22, 2024, Zide Chen wrote:
> Currently, the migration worker delays 1-10 us, assuming that one
> KVM_RUN iteration only takes a few microseconds.  But if C-state exit
> latencies are large enough, for example, hundreds or even thousands
> of microseconds on server CPUs, it may happen that it's not able to
> bring the target CPU out of C-state before the migration worker starts
> to migrate it to the next CPU.
> 
> If the system workload is light, most CPUs could be at a certain level
> of C-state, and the vCPU thread may waste milliseconds before it can
> actually migrate to a new CPU.

Well fudge.  That's definitely not on my bingo sheet.

> Thus, the tests may be inefficient in such systems, and in some cases
> it may fail the migration/KVM_RUN ratio sanity check.
> 
> Since we are not able to turn off the cpuidle sub-system in run time,
> this patch creates an idle thread on every CPU to prevent them from
> entering C-states.

First off, huge thanks for debugging this!  That must have been quite the task
(no pun intended).

While spinning up threads on every CPU is a clever way to ensure they don't go
into a deep sleep state, I'm not exactly excited about the idea of putting every
reachable CPU into a busy loop.  And while this doesn't add _that_ much complexity,
I'm not sure the benefit (preserving the assert for all systems) is worth it.  I
also don't want to arbitrarily prevent idle task (as in, the kernel's idle task)
interactions.  E.g. it's highly (highly) unlikely, but not impossible for there
to be a bug that's unique to idle tasks, or C-states, or other edge case.

Are there any metrics/stats that can be (easily) checked to grant an exception
to the sanity check?  That's a very hand-wavy question, as I'm not even sure what
type of stat we'd want to look at.  Actual runtime of a task, maybe?

If that's not easy, what if we add an off-by-default command line option to skip
the sanity check?  I was resistant to simply deleting the assert in the past, but
that was mainly because I didn't want to delete it without understanding what was
causing problems.  That would allow CI environments to opt-out as needed, while
still keeping the sanity check alive for enough systems to make it useful.
Chen, Zide April 12, 2024, 4:47 p.m. UTC | #2
On 4/5/2024 4:01 PM, Sean Christopherson wrote:
> On Fri, Mar 22, 2024, Zide Chen wrote:
>> Currently, the migration worker delays 1-10 us, assuming that one
>> KVM_RUN iteration only takes a few microseconds.  But if C-state exit
>> latencies are large enough, for example, hundreds or even thousands
>> of microseconds on server CPUs, it may happen that it's not able to
>> bring the target CPU out of C-state before the migration worker starts
>> to migrate it to the next CPU.
>>
>> If the system workload is light, most CPUs could be at a certain level
>> of C-state, and the vCPU thread may waste milliseconds before it can
>> actually migrate to a new CPU.
> 
> Well fudge.  That's definitely not on my bingo sheet.
> 
>> Thus, the tests may be inefficient in such systems, and in some cases
>> it may fail the migration/KVM_RUN ratio sanity check.
>>
>> Since we are not able to turn off the cpuidle sub-system in run time,
>> this patch creates an idle thread on every CPU to prevent them from
>> entering C-states.
> 
> First off, huge thanks for debugging this!  That must have been quite the task
> (no pun intended).
> 
> While spinning up threads on every CPU is a clever way to ensure they don't go
> into a deep sleep state, I'm not exactly excited about the idea of putting every
> reachable CPU into a busy loop.  And while this doesn't add _that_ much complexity,
> I'm not sure the benefit (preserving the assert for all systems) is worth it.  I
> also don't want to arbitrarily prevent idle task (as in, the kernel's idle task)
> interactions.  E.g. it's highly (highly) unlikely, but not impossible for there
> to be a bug that's unique to idle tasks, or C-states, or other edge case.
> 
> Are there any metrics/stats that can be (easily) checked to grant an exception
> to the sanity check?  That's a very hand-wavy question, as I'm not even sure what
> type of stat we'd want to look at.  Actual runtime of a task, maybe?
> 
> If that's not easy, what if we add an off-by-default command line option to skip
> the sanity check?  I was resistant to simply deleting the assert in the past, but
> that was mainly because I didn't want to delete it without understanding what was
> causing problems.  That would allow CI environments to opt-out as needed, while
> still keeping the sanity check alive for enough systems to make it useful.

Sorry for not replying earlier. I overlooked your email from my inbox. :)

Alternative to the busy loop, how about using the /dev/cpu_dma_latency
interface to disable c-states (I wish I had learned this before writing
the initial patch)? The good thing is it can do automatic cleanup when
it closes the fd.

The reason why I still think of disabling c-states is, even in the low
c-states exit latency setup, it can still increase the chances of
successful migration.

Otherwise, I can implement a command line option to skip the sanity
check, as I was not able to find out a metrics/stats that is easy and
reliable to indicate that the scheduler is not able to wake up the
target CPU before the task is scheduled to another CPU.
Sean Christopherson April 12, 2024, 6:52 p.m. UTC | #3
On Fri, Apr 12, 2024, Zide Chen wrote:
> On 4/5/2024 4:01 PM, Sean Christopherson wrote:
> > On Fri, Mar 22, 2024, Zide Chen wrote:
> >> Currently, the migration worker delays 1-10 us, assuming that one
> >> KVM_RUN iteration only takes a few microseconds.  But if C-state exit
> >> latencies are large enough, for example, hundreds or even thousands
> >> of microseconds on server CPUs, it may happen that it's not able to
> >> bring the target CPU out of C-state before the migration worker starts
> >> to migrate it to the next CPU.
> >>
> >> If the system workload is light, most CPUs could be at a certain level
> >> of C-state, and the vCPU thread may waste milliseconds before it can
> >> actually migrate to a new CPU.
> > 
> > Well fudge.  That's definitely not on my bingo sheet.
> > 
> >> Thus, the tests may be inefficient in such systems, and in some cases
> >> it may fail the migration/KVM_RUN ratio sanity check.
> >>
> >> Since we are not able to turn off the cpuidle sub-system in run time,
> >> this patch creates an idle thread on every CPU to prevent them from
> >> entering C-states.
> > 
> > First off, huge thanks for debugging this!  That must have been quite the task
> > (no pun intended).
> > 
> > While spinning up threads on every CPU is a clever way to ensure they don't go
> > into a deep sleep state, I'm not exactly excited about the idea of putting every
> > reachable CPU into a busy loop.  And while this doesn't add _that_ much complexity,
> > I'm not sure the benefit (preserving the assert for all systems) is worth it.  I
> > also don't want to arbitrarily prevent idle task (as in, the kernel's idle task)
> > interactions.  E.g. it's highly (highly) unlikely, but not impossible for there
> > to be a bug that's unique to idle tasks, or C-states, or other edge case.
> > 
> > Are there any metrics/stats that can be (easily) checked to grant an exception
> > to the sanity check?  That's a very hand-wavy question, as I'm not even sure what
> > type of stat we'd want to look at.  Actual runtime of a task, maybe?
> > 
> > If that's not easy, what if we add an off-by-default command line option to skip
> > the sanity check?  I was resistant to simply deleting the assert in the past, but
> > that was mainly because I didn't want to delete it without understanding what was
> > causing problems.  That would allow CI environments to opt-out as needed, while
> > still keeping the sanity check alive for enough systems to make it useful.
> 
> Sorry for not replying earlier. I overlooked your email from my inbox. :)
> 
> Alternative to the busy loop, how about using the /dev/cpu_dma_latency
> interface to disable c-states (I wish I had learned this before writing
> the initial patch)? The good thing is it can do automatic cleanup when
> it closes the fd.

It's probably not practical to touch /dev/cpu_dma_latency in code, e.g. on my
system it's fully root-only.  And forcing rseq_test to run as root, or be bookended
with script commands to toggle /dev/cpu_dma_latency, is not a net positive.
Lastly, fiddling with a system-wide knob in a KVM selftests is opening a can of
worms I don't want to open.

However, we could have the failing TEST_ASSERT() explicitly call out
/dev/cpu_dma_latency as a knob to try changing if the assert is failing.  If we
do that *and* add a command line option to skip the sanity check, that seems like
it would give users sufficient flexibility to avoid false positives, while still
maintaining good coverage.

> The reason why I still think of disabling c-states is, even in the low
> c-states exit latency setup, it can still increase the chances of
> successful migration.
> 
> Otherwise, I can implement a command line option to skip the sanity
> check, as I was not able to find out a metrics/stats that is easy and
> reliable to indicate that the scheduler is not able to wake up the
> target CPU before the task is scheduled to another CPU.
Chen, Zide April 12, 2024, 10:16 p.m. UTC | #4
On 4/12/2024 11:52 AM, Sean Christopherson wrote:
> On Fri, Apr 12, 2024, Zide Chen wrote:
>> On 4/5/2024 4:01 PM, Sean Christopherson wrote:
>>> On Fri, Mar 22, 2024, Zide Chen wrote:
>>>> Currently, the migration worker delays 1-10 us, assuming that one
>>>> KVM_RUN iteration only takes a few microseconds.  But if C-state exit
>>>> latencies are large enough, for example, hundreds or even thousands
>>>> of microseconds on server CPUs, it may happen that it's not able to
>>>> bring the target CPU out of C-state before the migration worker starts
>>>> to migrate it to the next CPU.
>>>>
>>>> If the system workload is light, most CPUs could be at a certain level
>>>> of C-state, and the vCPU thread may waste milliseconds before it can
>>>> actually migrate to a new CPU.
>>>
>>> Well fudge.  That's definitely not on my bingo sheet.
>>>
>>>> Thus, the tests may be inefficient in such systems, and in some cases
>>>> it may fail the migration/KVM_RUN ratio sanity check.
>>>>
>>>> Since we are not able to turn off the cpuidle sub-system in run time,
>>>> this patch creates an idle thread on every CPU to prevent them from
>>>> entering C-states.
>>>
>>> First off, huge thanks for debugging this!  That must have been quite the task
>>> (no pun intended).
>>>
>>> While spinning up threads on every CPU is a clever way to ensure they don't go
>>> into a deep sleep state, I'm not exactly excited about the idea of putting every
>>> reachable CPU into a busy loop.  And while this doesn't add _that_ much complexity,
>>> I'm not sure the benefit (preserving the assert for all systems) is worth it.  I
>>> also don't want to arbitrarily prevent idle task (as in, the kernel's idle task)
>>> interactions.  E.g. it's highly (highly) unlikely, but not impossible for there
>>> to be a bug that's unique to idle tasks, or C-states, or other edge case.
>>>
>>> Are there any metrics/stats that can be (easily) checked to grant an exception
>>> to the sanity check?  That's a very hand-wavy question, as I'm not even sure what
>>> type of stat we'd want to look at.  Actual runtime of a task, maybe?
>>>
>>> If that's not easy, what if we add an off-by-default command line option to skip
>>> the sanity check?  I was resistant to simply deleting the assert in the past, but
>>> that was mainly because I didn't want to delete it without understanding what was
>>> causing problems.  That would allow CI environments to opt-out as needed, while
>>> still keeping the sanity check alive for enough systems to make it useful.
>>
>> Sorry for not replying earlier. I overlooked your email from my inbox. :)
>>
>> Alternative to the busy loop, how about using the /dev/cpu_dma_latency
>> interface to disable c-states (I wish I had learned this before writing
>> the initial patch)? The good thing is it can do automatic cleanup when
>> it closes the fd.
> 
> It's probably not practical to touch /dev/cpu_dma_latency in code, e.g. on my
> system it's fully root-only.  And forcing rseq_test to run as root, or be bookended
> with script commands to toggle /dev/cpu_dma_latency, is not a net positive.
> Lastly, fiddling with a system-wide knob in a KVM selftests is opening a can of
> worms I don't want to open.
> 
> However, we could have the failing TEST_ASSERT() explicitly call out
> /dev/cpu_dma_latency as a knob to try changing if the assert is failing.  If we
> do that *and* add a command line option to skip the sanity check, that seems like
> it would give users sufficient flexibility to avoid false positives, while still
> maintaining good coverage.

Make sense, will do it in V2.
diff mbox series

Patch

diff --git a/tools/testing/selftests/kvm/rseq_test.c b/tools/testing/selftests/kvm/rseq_test.c
index 28f97fb52044..d6e8b851d29e 100644
--- a/tools/testing/selftests/kvm/rseq_test.c
+++ b/tools/testing/selftests/kvm/rseq_test.c
@@ -11,6 +11,7 @@ 
 #include <syscall.h>
 #include <sys/ioctl.h>
 #include <sys/sysinfo.h>
+#include <sys/resource.h>
 #include <asm/barrier.h>
 #include <linux/atomic.h>
 #include <linux/rseq.h>
@@ -29,9 +30,10 @@ 
 #define NR_TASK_MIGRATIONS 100000
 
 static pthread_t migration_thread;
+static pthread_t *idle_threads;
 static cpu_set_t possible_mask;
-static int min_cpu, max_cpu;
-static bool done;
+static int min_cpu, max_cpu, nproc;
+static volatile bool done;
 
 static atomic_t seq_cnt;
 
@@ -150,7 +152,7 @@  static void *migration_worker(void *__rseq_tid)
 		 * Use usleep() for simplicity and to avoid unnecessary kernel
 		 * dependencies.
 		 */
-		usleep((i % 10) + 1);
+		usleep((rand() % 10) + 1);
 	}
 	done = true;
 	return NULL;
@@ -158,7 +160,7 @@  static void *migration_worker(void *__rseq_tid)
 
 static void calc_min_max_cpu(void)
 {
-	int i, cnt, nproc;
+	int i, cnt;
 
 	TEST_REQUIRE(CPU_COUNT(&possible_mask) >= 2);
 
@@ -186,6 +188,61 @@  static void calc_min_max_cpu(void)
 		       "Only one usable CPU, task migration not possible");
 }
 
+static void *idle_thread_fn(void *__idle_cpu)
+{
+	int r, cpu = (int)(unsigned long)__idle_cpu;
+	cpu_set_t allowed_mask;
+
+	CPU_ZERO(&allowed_mask);
+	CPU_SET(cpu, &allowed_mask);
+
+	r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
+	TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
+		errno, strerror(errno));
+
+	/* lowest priority, trying to prevent it from entering C-states */
+	r = setpriority(PRIO_PROCESS, 0, 19);
+	TEST_ASSERT(!r, "setpriority failed, errno = %d (%s)",
+		errno, strerror(errno));
+
+	while(!done);
+
+	return NULL;
+}
+
+static void spawn_threads(void)
+{
+	int cpu;
+
+	/* Run a dummy thread on every CPU */
+	for (cpu = min_cpu; cpu <= max_cpu; cpu++) {
+		if (!CPU_ISSET(cpu, &possible_mask))
+			continue;
+
+		pthread_create(&idle_threads[cpu], NULL, idle_thread_fn,
+			       (void *)(unsigned long)cpu);
+	}
+
+	pthread_create(&migration_thread, NULL, migration_worker,
+		       (void *)(unsigned long)syscall(SYS_gettid));
+}
+
+static void join_threads(void)
+{
+	int cpu;
+
+	pthread_join(migration_thread, NULL);
+
+	for (cpu = min_cpu; cpu <= max_cpu; cpu++) {
+		if (!CPU_ISSET(cpu, &possible_mask))
+			continue;
+
+		pthread_join(idle_threads[cpu], NULL);
+	}
+
+	free(idle_threads);
+}
+
 int main(int argc, char *argv[])
 {
 	int r, i, snapshot;
@@ -199,6 +256,12 @@  int main(int argc, char *argv[])
 
 	calc_min_max_cpu();
 
+	srand(time(NULL));
+
+	idle_threads = malloc(sizeof(pthread_t) * nproc);
+	TEST_ASSERT(idle_threads, "malloc failed, errno = %d (%s)", errno,
+		    strerror(errno));
+
 	r = rseq_register_current_thread();
 	TEST_ASSERT(!r, "rseq_register_current_thread failed, errno = %d (%s)",
 		    errno, strerror(errno));
@@ -210,8 +273,7 @@  int main(int argc, char *argv[])
 	 */
 	vm = vm_create_with_one_vcpu(&vcpu, guest_code);
 
-	pthread_create(&migration_thread, NULL, migration_worker,
-		       (void *)(unsigned long)syscall(SYS_gettid));
+	spawn_threads();
 
 	for (i = 0; !done; i++) {
 		vcpu_run(vcpu);
@@ -258,7 +320,7 @@  int main(int argc, char *argv[])
 	TEST_ASSERT(i > (NR_TASK_MIGRATIONS / 2),
 		    "Only performed %d KVM_RUNs, task stalled too much?", i);
 
-	pthread_join(migration_thread, NULL);
+	join_threads();
 
 	kvm_vm_free(vm);