diff mbox series

[v2,04/17] x86/cpu/intel: Fix the movsl alignment preference for extended Families

Message ID 20250211194407.2577252-5-sohil.mehta@intel.com (mailing list archive)
State New
Headers show
Series Prepare for new Intel Family numbers | expand

Commit Message

Sohil Mehta Feb. 11, 2025, 7:43 p.m. UTC
The alignment preference for 32-bit movsl based bulk memory move has
been 8-byte for a long time. However this preference is only set for
Family 6 and 15 processors.

Extend the preference to upcoming Family numbers 18 and 19 to maintain
legacy behavior. Also, use a VFM based check instead of switching based
on Family numbers. Refresh the comment to reflect the new check.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>

---

v2: Split the patch into two parts. Update commit message.

---
 arch/x86/kernel/cpu/intel.c | 19 ++++++-------------
 1 file changed, 6 insertions(+), 13 deletions(-)

Comments

Dave Hansen Feb. 11, 2025, 8:26 p.m. UTC | #1
We should really rename intel_workarounds() to make it more clear that
it's 32-bit only. But I digress...

On 2/11/25 11:43, Sohil Mehta wrote:
> The alignment preference for 32-bit movsl based bulk memory move has
> been 8-byte for a long time. However this preference is only set for
> Family 6 and 15 processors.
> 
> Extend the preference to upcoming Family numbers 18 and 19 to maintain
> legacy behavior. Also, use a VFM based check instead of switching based
> on Family numbers. Refresh the comment to reflect the new check.
"Legacy behavior" is not important here. If anyone is running 32-bit
kernel binaries on their brand new CPUs they (as far as I know) have a
few screws loose. They don't care about performance or security and we
shouldn't care _for_ them.

If the code yielded the "wrong" movsl_mask.mask for 18/19, it wouldn't
matter one bit.

The thing that _does_ matter is someone auditing to figure out whether
the code comprehends families>15 or whether it would break in horrible
ways. The new check is shorter and it's more obvious that it will work
forever.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
David Laight Feb. 11, 2025, 9:45 p.m. UTC | #2
On Tue, 11 Feb 2025 12:26:48 -0800
Dave Hansen <dave.hansen@intel.com> wrote:

> We should really rename intel_workarounds() to make it more clear that
> it's 32-bit only. But I digress...
> 
> On 2/11/25 11:43, Sohil Mehta wrote:
> > The alignment preference for 32-bit movsl based bulk memory move has
> > been 8-byte for a long time. However this preference is only set for
> > Family 6 and 15 processors.
> > 
> > Extend the preference to upcoming Family numbers 18 and 19 to maintain
> > legacy behavior. Also, use a VFM based check instead of switching based
> > on Family numbers. Refresh the comment to reflect the new check.  
> "Legacy behavior" is not important here. If anyone is running 32-bit
> kernel binaries on their brand new CPUs they (as far as I know) have a
> few screws loose. They don't care about performance or security and we
> shouldn't care _for_ them.
> 
> If the code yielded the "wrong" movsl_mask.mask for 18/19, it wouldn't
> matter one bit.
> 
> The thing that _does_ matter is someone auditing to figure out whether
> the code comprehends families>15 or whether it would break in horrible
> ways. The new check is shorter and it's more obvious that it will work
> forever.

For any Intel non-atom processors since the Ivy bridge the only alignment
that makes real difference is aligning the destination to a 32 byte boundary.
That does make it twice as fast (32 bytes/clock rather than 16).
The source alignment never matters.
(I've got access to one of the later 64-bit 8 core atoms - but can't
remember how it behaves.)

For short (IRC 1..32) byte transfers the cost is constant.
The cost depends on the cpu, Ivy bridge is something like 40 clocks.
Lower for later cpu.
(Unlike the P4 where the overhead is some 163 clocks.)
It also makes no difference whether you do 'rep movsb' or 'rep movsq'.

For any of those cpu I'm not sure it is ever worth using anything
other than 'rep movsb' unless the length is known to be very short,
likely a multiply of 4/8 and preferably constant.
Doing a function call and a one or two mispredictable branches will
soon eat into the overhead. Not to mention displacing code from the I-cache.
Unless you are micro-optimising a very hot path it really isn't worth
doing anything else.

OTOH even some recent AMD cpu are reported not to have FRSM and will
execute 'rep movsb' slowly.

I did 'discover' that code at the weekend, just the memory load to
get the mask is going to slow things down.
Running a benchmark test it'll be in cache and the branch predictor
will remember what you are doing.
Come in 'cold cache' and (IIRC) Intel cpu have a 50% chance of predicting
a branch taken (no static predict - eg backward taken).

Even for normal memory accesses I've not seen any significant slowdown
for misaligned memory accesses.
Ones that cross a cache line might end up being 2 uops, but the cpu
can do two reads/clock (with a following wind) and it is hard to write
a code loop that gets close to sustaining that.

I'll have tested the IP checksum (adc loop) code with misaligned buffers.
I don't even remember a significant slowdown for the version that does
three memory reads every two clocks (which seems to be the limit).

I actually suspect that any copies that matter are aligned so the cost
of the check far outways the benefit across all the calls.

One optimisation that seems to be absent is that if you are doing a
register copy loop, then any trailing bytes can be copied by doing
a misaligned copy of the last word (and I mean word, not 16 bits)
of the buffer - copying a few bytes twice.

	David
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 3dce22f00dc3..e5f34a90963e 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -449,23 +449,16 @@  static void intel_workarounds(struct cpuinfo_x86 *c)
 	    (c->x86_stepping < 0x6 || c->x86_stepping == 0xb))
 		set_cpu_bug(c, X86_BUG_11AP);
 
-
 #ifdef CONFIG_X86_INTEL_USERCOPY
 	/*
-	 * Set up the preferred alignment for movsl bulk memory moves
+	 * movsl bulk memory moves can be slow when source and dest are not
+	 * both 8-byte aligned. PII/PIII only like movsl with 8-byte alignment.
+	 *
+	 * Set the preferred alignment for Pentium Pro and newer processors, as
+	 * it has only been tested on these.
 	 */
-	switch (c->x86) {
-	case 4:		/* 486: untested */
-		break;
-	case 5:		/* Old Pentia: untested */
-		break;
-	case 6:		/* PII/PIII only like movsl with 8-byte alignment */
+	if (c->x86_vfm >= INTEL_PENTIUM_PRO)
 		movsl_mask.mask = 7;
-		break;
-	case 15:	/* P4 is OK down to 8-byte alignment */
-		movsl_mask.mask = 7;
-		break;
-	}
 #endif
 
 	intel_smp_check(c);