Message ID | 20240507235344.249103-2-chang.seok.bae@intel.com (mailing list archive) |
---|---|
State | Changes Requested, archived |
Headers | show |
Series | x86/fpu: Extend kernel_fpu_begin_mask() for the In-Field Scan driver | expand |
On Tue, May 07 2024 at 16:53, Chang S. Bae wrote: > diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c > index 1209c7aebb21..04cc6f14ca42 100644 > --- a/arch/x86/kernel/fpu/core.c > +++ b/arch/x86/kernel/fpu/core.c > @@ -440,6 +440,9 @@ void kernel_fpu_begin_mask(unsigned int kfpu_mask) > > if (unlikely(kfpu_mask & KFPU_387) && boot_cpu_has(X86_FEATURE_FPU)) > asm volatile ("fninit"); > + > + if (unlikely(kfpu_mask & KFPU_AMX) && boot_cpu_has(X86_FEATURE_AMX_TILE)) cpu_feature_enabled() please > + restore_fpregs_from_fpstate(&init_fpstate, XFEATURE_MASK_XTILE); > } > EXPORT_SYMBOL_GPL(kernel_fpu_begin_mask);
The subject should probably be: Allow kernel FPU users to initialize AMX state On 5/7/24 16:53, Chang S. Bae wrote: > The In-Field Scan (IFS) test [1] is a destructive process, overwriting > the existing state to test the logic on the fly. As part of this test > process, the architectural state is saved before the test begins and > then restored upon completion. This should say "_most_ architectural state". > However, due to resource constraints in storage, AMX state is excluded > from the scope of state recovery. Consequently, AMX state must be in its > initialized state for the IFS test to run. This doesn't mention how this issue got introduced. Are we all bad at reading the SDM? :) > When AMX workloads are running, an active user AMX state remains even > after a context switch, optimizing to reduce the state reload cost. In > such cases, the test cannot proceed if it is scheduled. This is a bit out of the blue. What does scheduling have do do with IFS? > System administrators may attempt to mitigate this issue, by arranging > AMX workloads not to run on CPUs selected for the tests. However, this > approach is disruptive for managing large-scaled systems, diminishing the > benefit of the live testing. I personally prefer to go and mention alternative solutions *after* I've discussed the things that are truly relevant to the problem and fix. I think this distracts from the real issue. > The kernel can help by properly initializing the state before the test. > This initialization impacts the performance to some degree. But, this > approach is considerably cheaper than adding hardware resources and > simpler than a userspace approach. > > While fpu_idle_fpregs() can initialize the AMX state, its usage should be > limited to specialized cases, primarily before entering the sleep state. > The restore_fpregs_from_fpstate() function offers a suitable mechanism > for initializing fpstate in general, which remains within the core code. I'm not sure those last two paragraphs add much value. I'd try to banish most of that content to *after* you talk about the solution. Or maybe put it in the cover letter. > Extend kernel_fpu_begin_mask() to allow the IFS driver to initialize AMX > state through restore_fpregs_from_fpstate(). The function names are pretty blatantly obvious from the patch. No need to copy them here. As I look closer, I'm not sure I think this is a good match for the two other KFPU_* flags. I don't think either KFPU_387 or KFPU_MXCSR causes any XFEATURE to be tracked as init. The SDM says that FNINIT "sets the FPU control, status, tag, instruction pointer, and data pointer registers to their default states." Off the top of my head, I don't know how that maps to the XSAVE init tracking but I do know that MXCSR has a very weird mapping to the first two XSAVE features. I really think it would be simplest to just have this whole thing do this: kernel_fpu_begin(); // Zap AMX state kernel_fpu_end(); Where the zap is either an os_xrstor() or an explicit tile release instruction. It's just a matter of whether you want this to work like a regular old kernel FPU user or you want to tie it to *only* being able to run in its own kernel thread. -- Aside: I don't think I realized IFS had its own thread earlier.
On 5/8/2024 7:40 AM, Dave Hansen wrote: > On 5/7/24 16:53, Chang S. Bae wrote: > >> However, due to resource constraints in storage, AMX state is excluded >> from the scope of state recovery. Consequently, AMX state must be in its >> initialized state for the IFS test to run. > > This doesn't mention how this issue got introduced. Are we all bad at > reading the SDM? :) Ah, I'd rather zap out this SDM sentence. >> When AMX workloads are running, an active user AMX state remains even >> after a context switch, optimizing to reduce the state reload cost. In >> such cases, the test cannot proceed if it is scheduled. > > This is a bit out of the blue. What does scheduling have do do with IFS? $ echo <cpu#> > /sys/devices/virtual/misc/intel_ifs_0/run_test Then, run_test_store() -> do_core_test() -> ifs_test_core() -> stop_core_cpuslocked() -> stop_cpus() -> queue_stop_cpus_work() -> cpu_stop_queue_work() -> wake_q_add() -> wake_up_q() So, the CPU stopper threads for <cpu#> and its sibling to execute doscan() are queued up with the highest priority. queue_stop_cpus_work() has for_each_cpu(cpu, cpumask) { work = &per_cpu(cpu_stopper.stop_work, cpu); work->fn = fn; work->arg = arg; work->done = done; work->caller = _RET_IP_; if (cpu_stop_queue_work(cpu, work)) queued = true; } Those threads are created during early boot via smpboot_register_percpu_thread(). > I'm not sure those last two paragraphs add much value. I'd try to > banish most of that content to *after* you talk about the solution. Or > maybe put it in the cover letter. It looks like lots of distractions coming from bunch of alternatives in different levels. Thanks, Chang PS: Let me respond the solution discussion separately. I do want to experiment the init-track behavior a bit.
On 5/8/24 11:03, Chang S. Bae wrote: > On 5/8/2024 7:40 AM, Dave Hansen wrote: >> On 5/7/24 16:53, Chang S. Bae wrote: >> >>> However, due to resource constraints in storage, AMX state is excluded >>> from the scope of state recovery. Consequently, AMX state must be in its >>> initialized state for the IFS test to run. >> >> This doesn't mention how this issue got introduced. Are we all bad at >> reading the SDM? :) > > Ah, I'd rather zap out this SDM sentence. My point is that this is fixing a bug. Where did that bug come from? What got screwed up here? Hint: I don't think us software folks screwed up here. It was likely the folks that built the two hardware features (AMX and IFS) forgot to talk to each other, or someone forgot to document the AMX clobbering aspect of the architecture. >>> When AMX workloads are running, an active user AMX state remains even >>> after a context switch, optimizing to reduce the state reload cost. In >>> such cases, the test cannot proceed if it is scheduled. >> >> This is a bit out of the blue. What does scheduling have do do with IFS? ... > So, the CPU stopper threads for <cpu#> and its sibling to execute > doscan() are queued up with the highest priority. ... But this is the IFS implementation *today*. The explanation depends on IFS being implemented with something that context switches. It also depends on folks expecting context switches to always switch FPU state. I'd just say: The kernel generally runs with live user FPU state, including AMX. That state can prevent IFS tests from running. That's _much_ more simple, generic and also fully explains the situation. It also isn't dependent on the IFS stop_cpus_run() implementation of today, which could totally change tomorrow. The underlying rule has zero to do with scheduling or context switching optimizations.
On 5/8/2024 7:40 AM, Dave Hansen wrote: > > As I look closer, I'm not sure I think this is a good match for the two > other KFPU_* flags. I don't think either KFPU_387 or KFPU_MXCSR causes > any XFEATURE to be tracked as init. The SDM says that FNINIT "sets the > FPU control, status, tag, instruction pointer, and data pointer > registers to their default states." > > Off the top of my head, I don't know how that maps to the XSAVE init > tracking but I do know that MXCSR has a very weird mapping to the first > two XSAVE features. FNINIT does *not* set the component to be tracked as init. Indeed, the SSE component is somewhat convoluted. AMX state will be likely tracked as init. But, I believe this init tracking mechanism is implementation-specific, not architectural. Beyond this intricacy of init-tracking, KFPU_* flags all initialize each state: /* Kernel FPU states to initialize in kernel_fpu_begin_mask() */ #define KFPU_387 _BITUL(0) /* 387 state will be initialized */ #define KFPU_MXCSR _BITUL(1) /* MXCSR will be initialized */ > I really think it would be simplest to just have this whole thing do this: > > kernel_fpu_begin(); > > // Zap AMX state > > kernel_fpu_end(); > > Where the zap is either an os_xrstor() or an explicit tile release > instruction. If I understand correctly, the change could be something like this, although I'm not sure about this new API naming and prototype at this point: diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h index a2be3aefff9f..906634402ea6 100644 --- a/arch/x86/include/asm/fpu/api.h +++ b/arch/x86/include/asm/fpu/api.h @@ -28,6 +28,7 @@ extern void kernel_fpu_begin_mask(unsigned int kfpu_mask); extern void kernel_fpu_end(void); +extern void kernel_fpu_reset(void); extern bool irq_fpu_usable(void); extern void fpregs_mark_activate(void); diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index 1209c7aebb21..0351f9ee3e49 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -452,6 +452,15 @@ void kernel_fpu_end(void) } EXPORT_SYMBOL_GPL(kernel_fpu_end); +void kernel_fpu_reset(void) +{ + kernel_fpu_begin(); + if (cpu_feature_enabled(X86_FEATURE_AMX_TILE)) + tile_release(); + kernel_fpu_end(); +} +EXPORT_SYMBOL(kernel_fpu_reset); + /* * Sync the FPU register state to current's memory register state when the * current task owns the FPU. The hardware register state is preserved. diff --git a/drivers/platform/x86/intel/ifs/ifs.h b/drivers/platform/x86/intel/ifs/ifs.h index 56b9f3e3cf76..5e7ba94b4054 100644 --- a/drivers/platform/x86/intel/ifs/ifs.h +++ b/drivers/platform/x86/intel/ifs/ifs.h @@ -129,6 +129,7 @@ */ #include <linux/device.h> #include <linux/miscdevice.h> +#include <asm/fpu/api.h> #define MSR_ARRAY_BIST 0x00000105 #define MSR_COPY_SCAN_HASHES 0x000002c2 diff --git a/drivers/platform/x86/intel/ifs/runtest.c b/drivers/platform/x86/intel/ifs/runtest.c index 95b4b71fab53..33ef4962aeba 100644 --- a/drivers/platform/x86/intel/ifs/runtest.c +++ b/drivers/platform/x86/intel/ifs/runtest.c @@ -188,6 +188,8 @@ static int doscan(void *data) /* Only the first logical CPU on a core reports result */ first = cpumask_first(cpu_smt_mask(cpu)); + kernel_fpu_reset(); + wait_for_sibling_cpu(&scan_cpus_in, NSEC_PER_SEC); /* > It's just a matter of whether you want this to work like a regular old > kernel FPU user or you want to tie it to *only* being able to run in its > own kernel thread. -- Aside: I don't think I realized IFS had its own > thread earlier. While both approaches aim for the same functionality, the code change seems to be smaller for the above. At the same time, it is notable that these new APIs, fpu_idle_fpregs() and the other, are *only* for AMX. 1. The diff state of this series' approach: arch/x86/include/asm/fpu/api.h | 1 + arch/x86/kernel/fpu/core.c | 3 +++ drivers/platform/x86/intel/ifs/ifs.h | 15 +++++++++++++++ drivers/platform/x86/intel/ifs/runtest.c | 6 ++++++ 4 files changed, 25 insertions(+) 2. The above code changes: arch/x86/include/asm/fpu/api.h | 1 + arch/x86/kernel/fpu/core.c | 9 +++++++++ drivers/platform/x86/intel/ifs/ifs.h | 1 + drivers/platform/x86/intel/ifs/runtest.c | 2 ++ 4 files changed, 13 insertions(+) Thanks, Chang
On 5/8/24 17:29, Chang S. Bae wrote: > +void kernel_fpu_reset(void) > +{ > + kernel_fpu_begin(); > + if (cpu_feature_enabled(X86_FEATURE_AMX_TILE)) > + tile_release(); > + kernel_fpu_end(); > +} > +EXPORT_SYMBOL(kernel_fpu_reset); > + ... > --- a/drivers/platform/x86/intel/ifs/runtest.c > +++ b/drivers/platform/x86/intel/ifs/runtest.c > @@ -188,6 +188,8 @@ static int doscan(void *data) > /* Only the first logical CPU on a core reports result */ > first = cpumask_first(cpu_smt_mask(cpu)); > > + kernel_fpu_reset(); > + > wait_for_sibling_cpu(&scan_cpus_in, NSEC_PER_SEC); Remember, kernel_fpu_begin/end() mark a section of code that needs the FPU. Once code calls kernel_fpu_end(), it no longer owns the FPU and all bets are off. A interrupt could theoretically come in and do whatever it wants. I _assume_ that this is practically impossible since the stop_machine() infrastructure keeps interrupts at bay. But it's rather subtle. I'd probably just do this: + kernel_fpu_begin(); + // AMX *MUST* be in the init state for the wrmsr() to work. + // But, the more in the init state, the less state the test + // has to save and restore. Just zap everything. + restore_fpregs_from_fpstate(&init_fpstate, + fpu_user_cfg.max_features); + wrmsrl(MSR_ACTIVATE_SCAN, params->activate->data); rdmsrl(MSR_SCAN_STATUS, status.data); + kernel_fpu_end(); That's dirt simple. It doesn't require new infrastructure. It doesn't call an opaque new helper. It doesn't require a feature check. It probably makes the IFS test run faster. It will also magically work for any fancy new feature that comes along which *ALSO* needs to be in its init state ... with zero changes to this code. For bonus points, this code is quite universal. It will work, as-is, in a bunch of kernel contexts if future deranged kernel developer copies and pastes it. The code you suggested above can race unless it's called under stop_machine() and isn't safe to copy elsewhere. Three lines of code: 1. IFS declares its need to own the FPU for a moment, like any other kernel_fpu_begin() user. It's not a special snowflake. It is boring. 2. IFS zaps the FPU state 3. IFS gives up the FPU Am I out of my mind? What am I missing? Why bother with _anything_ more complicated than this?
On 5/9/24 10:36, Dave Hansen wrote: > + restore_fpregs_from_fpstate(&init_fpstate, > + fpu_user_cfg.max_features); There is *one* subtlety here. This assumes that 'init_fpstate' has AMX marked as being in its init state. But I think that's a pretty safe assumption.
On 5/9/2024 10:36 AM, Dave Hansen wrote: > > I'd probably just do this: > > + kernel_fpu_begin(); > + // AMX *MUST* be in the init state for the wrmsr() to work. > + // But, the more in the init state, the less state the test > + // has to save and restore. Just zap everything. > + restore_fpregs_from_fpstate(&init_fpstate, > + fpu_user_cfg.max_features); > + I assume that this snippet goes to the IFS driver side. Then, we need to introduce and export a new wrapper for this. restore_fpregs_from_fpstate() and its arguments are not accessible as of now. Also, I think we should encapsulate them. If we follow this style, we could have invoked tilerelease() directly from the idle driver, right? > wrmsrl(MSR_ACTIVATE_SCAN, params->activate->data); > rdmsrl(MSR_SCAN_STATUS, status.data); > > + kernel_fpu_end();
On 5/9/24 11:41, Chang S. Bae wrote: > On 5/9/2024 10:36 AM, Dave Hansen wrote: >> >> I'd probably just do this: >> >> + kernel_fpu_begin(); >> + // AMX *MUST* be in the init state for the wrmsr() to work. >> + // But, the more in the init state, the less state the test >> + // has to save and restore. Just zap everything. >> + restore_fpregs_from_fpstate(&init_fpstate, >> + fpu_user_cfg.max_features); >> + > > I assume that this snippet goes to the IFS driver side. Then, we need > to introduce and export a new wrapper for this. > restore_fpregs_from_fpstate() and its arguments are not accessible as > of now. Yes, a new wrapper to initialize all the user FPU state is fine. > Also, I think we should encapsulate them. If we follow this style, we > could have invoked tilerelease() directly from the idle driver, right? You could have... But I think the point there truly was to do a minimal amount of work because you're going idle and need to race to get there. The path to idle is super hot and you don't want to do _any_ additional work. It's worth writing highly AMX-specific code because generic code would be slow there. The IFS path is a super duper slow one. It takes *FOREVER*. I like the idea of being simple, dumb and slow.
On Thu, May 09 2024 at 10:36, Dave Hansen wrote: > Three lines of code: > > 1. IFS declares its need to own the FPU for a moment, like any > other kernel_fpu_begin() user. It's not a special snowflake. > It is boring. > 2. IFS zaps the FPU state > 3. IFS gives up the FPU > > Am I out of my mind? What am I missing? Why bother with _anything_ > more complicated than this? Right. Keep it simple :)
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h index a2be3aefff9f..67887fc45c24 100644 --- a/arch/x86/include/asm/fpu/api.h +++ b/arch/x86/include/asm/fpu/api.h @@ -25,6 +25,7 @@ /* Kernel FPU states to initialize in kernel_fpu_begin_mask() */ #define KFPU_387 _BITUL(0) /* 387 state will be initialized */ #define KFPU_MXCSR _BITUL(1) /* MXCSR will be initialized */ +#define KFPU_AMX _BITUL(2) /* AMX will be initialized */ extern void kernel_fpu_begin_mask(unsigned int kfpu_mask); extern void kernel_fpu_end(void); diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index 1209c7aebb21..04cc6f14ca42 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -440,6 +440,9 @@ void kernel_fpu_begin_mask(unsigned int kfpu_mask) if (unlikely(kfpu_mask & KFPU_387) && boot_cpu_has(X86_FEATURE_FPU)) asm volatile ("fninit"); + + if (unlikely(kfpu_mask & KFPU_AMX) && boot_cpu_has(X86_FEATURE_AMX_TILE)) + restore_fpregs_from_fpstate(&init_fpstate, XFEATURE_MASK_XTILE); } EXPORT_SYMBOL_GPL(kernel_fpu_begin_mask);
The In-Field Scan (IFS) test [1] is a destructive process, overwriting the existing state to test the logic on the fly. As part of this test process, the architectural state is saved before the test begins and then restored upon completion. However, due to resource constraints in storage, AMX state is excluded from the scope of state recovery. Consequently, AMX state must be in its initialized state for the IFS test to run. When AMX workloads are running, an active user AMX state remains even after a context switch, optimizing to reduce the state reload cost. In such cases, the test cannot proceed if it is scheduled. System administrators may attempt to mitigate this issue, by arranging AMX workloads not to run on CPUs selected for the tests. However, this approach is disruptive for managing large-scaled systems, diminishing the benefit of the live testing. The kernel can help by properly initializing the state before the test. This initialization impacts the performance to some degree. But, this approach is considerably cheaper than adding hardware resources and simpler than a userspace approach. While fpu_idle_fpregs() can initialize the AMX state, its usage should be limited to specialized cases, primarily before entering the sleep state. The restore_fpregs_from_fpstate() function offers a suitable mechanism for initializing fpstate in general, which remains within the core code. Extend kernel_fpu_begin_mask() to allow the IFS driver to initialize AMX state through restore_fpregs_from_fpstate(). [1]: https://docs.kernel.org/arch/x86/ifs.html Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> --- V1 -> V2: Revise the changelog (Dave Hansen and Ashok Raj). The recently published IFS documentation [2] elaborates its purpose and the requirements of the context restoration after the scan test. Additionally, the necessity for AMX initialization is emphasized in the Intel Software Development Manual as of March 2024, in Section 18.2 of Vol.1. Side note: restore_fpregs_from_fpstate() also sets the x87 state to a fixed value. However, this only applies to AMD CPUs with the FXSAVE_LEAK quirk. [2] IFS Technical Paper: Finding Faulty Components in a Live Fleet Environment https://www.intel.com/content/www/us/en/content-details/822279/finding-faulty-components-in-a-live-fleet-environment.html --- arch/x86/include/asm/fpu/api.h | 1 + arch/x86/kernel/fpu/core.c | 3 +++ 2 files changed, 4 insertions(+)