diff mbox series

[RFC,v2] parisc: Add alternative coding when running UP

Message ID 20181014183424.GA20783@ls3530.fritz.box (mailing list archive)
State Superseded
Headers show
Series [RFC,v2] parisc: Add alternative coding when running UP | expand

Commit Message

Helge Deller Oct. 14, 2018, 6:34 p.m. UTC
This patch adds the necessary code to patch a running SMP kernel
at runtime to improve performance when running on a single CPU.

The current implementation offers two patching variants:
- Unwanted assembler statements like locking functions are overwritten
  with NOPs. When multiple instructions shall be skipped, one branch
  instruction is used instead of multiple nop instructions.

- Some pdtlb and pitlb instructions are patched to become pdtlb,l and
  pitlb,l which only flushes the CPU-local tlb entries instead of
  broadcasting the flush to other CPUs in the system and thus may
  improve performance.

Live-patching is done early in the boot process, just after having run
the system inventory. No drivers are running and thus no external
interrupts should arrive. So the hope is that no TLB exceptions will
occur during the patching. If this turns out to be wrong we will
probably need to do the patching in real-mode.

Signed-off-by: Helge Deller <deller@gmx.de>

Comments

James Bottomley Oct. 15, 2018, 9:11 p.m. UTC | #1
On Sun, 2018-10-14 at 20:34 +0200, Helge Deller wrote:
> This patch adds the necessary code to patch a running SMP kernel
> at runtime to improve performance when running on a single CPU.
> 
> The current implementation offers two patching variants:
> - Unwanted assembler statements like locking functions are
> overwritten
>   with NOPs. When multiple instructions shall be skipped, one branch
>   instruction is used instead of multiple nop instructions.

This seems like a good idea because our spinlocks are particularly
heavyweight.

> - Some pdtlb and pitlb instructions are patched to become pdtlb,l and
>   pitlb,l which only flushes the CPU-local tlb entries instead of
>   broadcasting the flush to other CPUs in the system and thus may
>   improve performance.

I really don't think this matters: on a UP system, ptdlb,l and pdtlb
are the same instruction because the CPU already knows is has no
internal CPU bus to broadcast the purge over so it in effect executes a
pdtlb,l regardless.

James
Helge Deller Oct. 16, 2018, 5:34 a.m. UTC | #2
On 15.10.2018 23:11, James Bottomley wrote:
> On Sun, 2018-10-14 at 20:34 +0200, Helge Deller wrote:
>> This patch adds the necessary code to patch a running SMP kernel
>> at runtime to improve performance when running on a single CPU.
>>
>> The current implementation offers two patching variants:
>> - Unwanted assembler statements like locking functions are
>> overwritten
>>   with NOPs. When multiple instructions shall be skipped, one branch
>>   instruction is used instead of multiple nop instructions.
> 
> This seems like a good idea because our spinlocks are particularly
> heavyweight.
> 
>> - Some pdtlb and pitlb instructions are patched to become pdtlb,l and
>>   pitlb,l which only flushes the CPU-local tlb entries instead of
>>   broadcasting the flush to other CPUs in the system and thus may
>>   improve performance.
> 
> I really don't think this matters: on a UP system, ptdlb,l and pdtlb
> are the same instruction because the CPU already knows is has no
> internal CPU bus to broadcast the purge over so it in effect executes a
> pdtlb,l regardless.

I'd be happy to drop this part again.
But is that true on a SMP system, where one has booted with maxcpus=1, too?

Helge
John David Anglin Oct. 16, 2018, 12:08 p.m. UTC | #3
On 2018-10-16 1:34 AM, Helge Deller wrote:
> On 15.10.2018 23:11, James Bottomley wrote:
>> On Sun, 2018-10-14 at 20:34 +0200, Helge Deller wrote:
>>> This patch adds the necessary code to patch a running SMP kernel
>>> at runtime to improve performance when running on a single CPU.
>>>
>>> The current implementation offers two patching variants:
>>> - Unwanted assembler statements like locking functions are
>>> overwritten
>>>    with NOPs. When multiple instructions shall be skipped, one branch
>>>    instruction is used instead of multiple nop instructions.
>> This seems like a good idea because our spinlocks are particularly
>> heavyweight.
>>
>>> - Some pdtlb and pitlb instructions are patched to become pdtlb,l and
>>>    pitlb,l which only flushes the CPU-local tlb entries instead of
>>>    broadcasting the flush to other CPUs in the system and thus may
>>>    improve performance.
>> I really don't think this matters: on a UP system, ptdlb,l and pdtlb
>> are the same instruction because the CPU already knows is has no
>> internal CPU bus to broadcast the purge over so it in effect executes a
>> pdtlb,l regardless.
> I'd be happy to drop this part again.
> But is that true on a SMP system, where one has booted with maxcpus=1, too?
I would like to see what happens on panama.  Panama is a rp3410. 
Currently, it takes
approximately 4042 cycles to flush one page (4096 bytes).  This is way 
more than the number
of cycles that I see on my rp3440.  My c3750 takes 450 cycles per page 
with patch.  It could
be ptdlb,l and pdtlb are equivalent on c3750.

Is there something wrong with SMP on panama?
Oct  4 02:27:56 panama kernel: [    0.061736] smp: Bringing up secondary 
CPUs ...
Oct  4 02:27:56 panama kernel: [    0.061897] smp: Brought up 3 nodes, 1 CPU

I know replacing "sync and normal store" with ordered store in spin lock 
release makes a
significant difference in the above timing.  Plan to send patch tonight.

Dave
James Bottomley Oct. 16, 2018, 2:29 p.m. UTC | #4
On Tue, 2018-10-16 at 07:34 +0200, Helge Deller wrote:
> On 15.10.2018 23:11, James Bottomley wrote:
> > On Sun, 2018-10-14 at 20:34 +0200, Helge Deller wrote:
> > > This patch adds the necessary code to patch a running SMP kernel
> > > at runtime to improve performance when running on a single CPU.
> > > 
> > > The current implementation offers two patching variants:
> > > - Unwanted assembler statements like locking functions are
> > > overwritten
> > >   with NOPs. When multiple instructions shall be skipped, one
> > > branch
> > >   instruction is used instead of multiple nop instructions.
> > 
> > This seems like a good idea because our spinlocks are particularly
> > heavyweight.
> > 
> > > - Some pdtlb and pitlb instructions are patched to become pdtlb,l
> > > and
> > >   pitlb,l which only flushes the CPU-local tlb entries instead of
> > >   broadcasting the flush to other CPUs in the system and thus may
> > >   improve performance.
> > 
> > I really don't think this matters: on a UP system, ptdlb,l and
> > pdtlb are the same instruction because the CPU already knows is has
> > no internal CPU bus to broadcast the purge over so it in effect
> > executes a pdtlb,l regardless.
> 
> I'd be happy to drop this part again.
> But is that true on a SMP system, where one has booted with
> maxcpus=1, too?

I don't think so because the secondaries will all be in their active
boot loops, so the internal coherence bus will also be active.  It's
not really clear that's a common case, though ...

James
Helge Deller Oct. 16, 2018, 8:20 p.m. UTC | #5
On 16.10.2018 16:29, James Bottomley wrote:
> On Tue, 2018-10-16 at 07:34 +0200, Helge Deller wrote:
>> On 15.10.2018 23:11, James Bottomley wrote:
>>> On Sun, 2018-10-14 at 20:34 +0200, Helge Deller wrote:
>>>> This patch adds the necessary code to patch a running SMP kernel
>>>> at runtime to improve performance when running on a single CPU.
>>>>
>>>> The current implementation offers two patching variants:
>>>> - Unwanted assembler statements like locking functions are
>>>> overwritten
>>>>   with NOPs. When multiple instructions shall be skipped, one
>>>> branch
>>>>   instruction is used instead of multiple nop instructions.
>>>
>>> This seems like a good idea because our spinlocks are particularly
>>> heavyweight.
>>>
>>>> - Some pdtlb and pitlb instructions are patched to become pdtlb,l
>>>> and
>>>>   pitlb,l which only flushes the CPU-local tlb entries instead of
>>>>   broadcasting the flush to other CPUs in the system and thus may
>>>>   improve performance.
>>>
>>> I really don't think this matters: on a UP system, ptdlb,l and
>>> pdtlb are the same instruction because the CPU already knows is has
>>> no internal CPU bus to broadcast the purge over so it in effect
>>> executes a pdtlb,l regardless.
>>
>> I'd be happy to drop this part again.
>> But is that true on a SMP system, where one has booted with
>> maxcpus=1, too?
> 
> I don't think so because the secondaries will all be in their active
> boot loops, so the internal coherence bus will also be active.  It's
> not really clear that's a common case, though ...

Ok, since it doesn't hurt to keep the pdtlb->pdtlb,l replacement I think
we simply keep it. It doesn't generate overhead either.

Helge
Helge Deller Oct. 16, 2018, 8:51 p.m. UTC | #6
On 16.10.2018 14:08, John David Anglin wrote:
> On 2018-10-16 1:34 AM, Helge Deller wrote:
>> On 15.10.2018 23:11, James Bottomley wrote:
>>> On Sun, 2018-10-14 at 20:34 +0200, Helge Deller wrote:
>>>> This patch adds the necessary code to patch a running SMP kernel
>>>> at runtime to improve performance when running on a single CPU.
>>>>
>>>> The current implementation offers two patching variants:
>>>> - Unwanted assembler statements like locking functions are
>>>> overwritten
>>>>    with NOPs. When multiple instructions shall be skipped, one branch
>>>>    instruction is used instead of multiple nop instructions.
>>> This seems like a good idea because our spinlocks are particularly
>>> heavyweight.
>>>
>>>> - Some pdtlb and pitlb instructions are patched to become pdtlb,l and
>>>>    pitlb,l which only flushes the CPU-local tlb entries instead of
>>>>    broadcasting the flush to other CPUs in the system and thus may
>>>>    improve performance.
>>> I really don't think this matters: on a UP system, ptdlb,l and pdtlb
>>> are the same instruction because the CPU already knows is has no
>>> internal CPU bus to broadcast the purge over so it in effect executes a
>>> pdtlb,l regardless.
>> I'd be happy to drop this part again.
>> But is that true on a SMP system, where one has booted with maxcpus=1, too?
> I would like to see what happens on panama.  Panama is a rp3410. Currently, it takes
> approximately 4042 cycles to flush one page (4096 bytes).  This is way more than the number
> of cycles that I see on my rp3440.  My c3750 takes 450 cycles per page with patch.  It could
> be ptdlb,l and pdtlb are equivalent on c3750.

Depends on what you flush.
On c3750 we may get fooled because the kernel area could have been mapped via huge pages,
while on rp34x0 the PA8900 CPU prevents huge pages for kernel.
That may explain the performance difference between c3750 and rp3410, but not
the difference to rp3440.

> Is there something wrong with SMP on panama?
> Oct  4 02:27:56 panama kernel: [    0.061736] smp: Bringing up secondary CPUs ...
> Oct  4 02:27:56 panama kernel: [    0.061897] smp: Brought up 3 nodes, 1 CPU

Will check tomorrow.
 
> I know replacing "sync and normal store" with ordered store in spin lock release makes a
> significant difference in the above timing.  Plan to send patch tonight.

What exactly do you want me to test on panama?
Is the git head with my latest for-next tree [1] OK ?

Helge

[1] https://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git/log/?h=for-next
John David Anglin Oct. 16, 2018, 9:45 p.m. UTC | #7
On 2018-10-16 4:51 PM, Helge Deller wrote:
> On 16.10.2018 14:08, John David Anglin wrote:
>> On 2018-10-16 1:34 AM, Helge Deller wrote:
>>> On 15.10.2018 23:11, James Bottomley wrote:
>>>> On Sun, 2018-10-14 at 20:34 +0200, Helge Deller wrote:
>>>>> This patch adds the necessary code to patch a running SMP kernel
>>>>> at runtime to improve performance when running on a single CPU.
>>>>>
>>>>> The current implementation offers two patching variants:
>>>>> - Unwanted assembler statements like locking functions are
>>>>> overwritten
>>>>>     with NOPs. When multiple instructions shall be skipped, one branch
>>>>>     instruction is used instead of multiple nop instructions.
>>>> This seems like a good idea because our spinlocks are particularly
>>>> heavyweight.
>>>>
>>>>> - Some pdtlb and pitlb instructions are patched to become pdtlb,l and
>>>>>     pitlb,l which only flushes the CPU-local tlb entries instead of
>>>>>     broadcasting the flush to other CPUs in the system and thus may
>>>>>     improve performance.
>>>> I really don't think this matters: on a UP system, ptdlb,l and pdtlb
>>>> are the same instruction because the CPU already knows is has no
>>>> internal CPU bus to broadcast the purge over so it in effect executes a
>>>> pdtlb,l regardless.
>>> I'd be happy to drop this part again.
>>> But is that true on a SMP system, where one has booted with maxcpus=1, too?
>> I would like to see what happens on panama.  Panama is a rp3410. Currently, it takes
>> approximately 4042 cycles to flush one page (4096 bytes).  This is way more than the number
>> of cycles that I see on my rp3440.  My c3750 takes 450 cycles per page with patch.  It could
>> be ptdlb,l and pdtlb are equivalent on c3750.
> Depends on what you flush.
> On c3750 we may get fooled because the kernel area could have been mapped via huge pages,
> while on rp34x0 the PA8900 CPU prevents huge pages for kernel.
> That may explain the performance difference between c3750 and rp3410, but not
> the difference to rp3440.
Regardless of whether the kernel area is mapped via huge pages, the loop 
uses PAGE_SIZE which is set to  4KB.
I think there are 240 TLB entries on the above machines.  Does the size 
of the mapping matter?

I could see huge pages slowly the test as one would get a page fault 
after every purge.  Debian kernel
is built with CONFIG_HUGETLB_PAGE.
>
>> Is there something wrong with SMP on panama?
>> Oct  4 02:27:56 panama kernel: [    0.061736] smp: Bringing up secondary CPUs ...
>> Oct  4 02:27:56 panama kernel: [    0.061897] smp: Brought up 3 nodes, 1 CPU
> Will check tomorrow.
>   
>> I know replacing "sync and normal store" with ordered store in spin lock release makes a
>> significant difference in the above timing.  Plan to send patch tonight.
> What exactly do you want me to test on panama?
pdtlb versus pdtlb,l.  It seems pdtlb is very slow on panama.
> Is the git head with my latest for-next tree [1] OK ?
It's a moving target ;-)
>
> Helge
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git/log/?h=for-next
>

Dave
Helge Deller Oct. 17, 2018, 2:52 p.m. UTC | #8
On 16.10.2018 23:45, John David Anglin wrote:
> On 2018-10-16 4:51 PM, Helge Deller wrote:
>> On 16.10.2018 14:08, John David Anglin wrote:
>>> On 2018-10-16 1:34 AM, Helge Deller wrote:
>>>> On 15.10.2018 23:11, James Bottomley wrote:
>>>>> On Sun, 2018-10-14 at 20:34 +0200, Helge Deller wrote:
>>>>>> This patch adds the necessary code to patch a running SMP kernel
>>>>>> at runtime to improve performance when running on a single CPU.
>>>>>>
>>>>>> The current implementation offers two patching variants:
>>>>>> - Unwanted assembler statements like locking functions are
>>>>>> overwritten
>>>>>>     with NOPs. When multiple instructions shall be skipped, one branch
>>>>>>     instruction is used instead of multiple nop instructions.
>>>>> This seems like a good idea because our spinlocks are particularly
>>>>> heavyweight.
>>>>>
>>>>>> - Some pdtlb and pitlb instructions are patched to become pdtlb,l and
>>>>>>     pitlb,l which only flushes the CPU-local tlb entries instead of
>>>>>>     broadcasting the flush to other CPUs in the system and thus may
>>>>>>     improve performance.
>>>>> I really don't think this matters: on a UP system, ptdlb,l and pdtlb
>>>>> are the same instruction because the CPU already knows is has no
>>>>> internal CPU bus to broadcast the purge over so it in effect executes a
>>>>> pdtlb,l regardless.
>>>> I'd be happy to drop this part again.
>>>> But is that true on a SMP system, where one has booted with maxcpus=1, too?
>>> I would like to see what happens on panama.  Panama is a rp3410. Currently, it takes
>>> approximately 4042 cycles to flush one page (4096 bytes).  This is way more than the number
>>> of cycles that I see on my rp3440.  My c3750 takes 450 cycles per page with patch.  It could
>>> be ptdlb,l and pdtlb are equivalent on c3750.
>> Depends on what you flush.
>> On c3750 we may get fooled because the kernel area could have been mapped via huge pages,
>> while on rp34x0 the PA8900 CPU prevents huge pages for kernel.
>> That may explain the performance difference between c3750 and rp3410, but not
>> the difference to rp3440.
> Regardless of whether the kernel area is mapped via huge pages, the loop uses PAGE_SIZE which is set to  4KB.
> I think there are 240 TLB entries on the above machines.  Does the size of the mapping matter?
> 
> I could see huge pages slowly the test as one would get a page fault after every purge.  Debian kernel
> is built with CONFIG_HUGETLB_PAGE.

Here are some numbers for a L3000 (rp5470) and panama (rp3410):

rp5470:
cpu family      : PA-RISC 2.0
cpu             : PA8700 (PCX-W2)
cpu MHz         : 875.000000
capabilities    : os64 iopdir_fdc nva_supported (0x05)
model           : 9000/800/L3000-8x
model name      : Marcato W+ (rp5470)?
I-cache         : 768 KB
D-cache         : 1536 KB (WB, direct mapped)
ITLB entries    : 240
DTLB entries    : 240 - shared with ITLB

4.19.0-rc8-64bit+ (plain Linux git head)
[    3.909737] CPU(s): 4 out of 4 PA8700 (PCX-W2) at 875.000000 MHz online
[    3.921368] Whole cache flush 632875 cycles, flushing 19197952 bytes 9371547 cycles
[    3.921387] Cache flush threshold set to 1266 KiB
[    3.922510] TLB flush threshold set to 240 KiB

4.19.0-rc7-64bit+ (all for-next patches incl. Dave's, but without alternative patching):
[    4.143616] CPU(s): 4 out of 4 PA8700 (PCX-W2) at 875.000000 MHz online
[    4.154970] Whole cache flush 629995 cycles, flushing 19173376 bytes 9103971 cycles
[    4.154992] Cache flush threshold set to 1295 KiB
[    4.155181] TLB flush threshold set to 240 KiB

4.19.0-rc7-64bit+ (all for-next patches incl. Dave's and WITH alternative patching):
[    4.143193] CPU(s): 4 out of 4 PA8700 (PCX-W2) at 875.000000 MHz online
[   28.327580] Whole cache flush 665022 cycles, flushing 19169280 bytes 9328514 cycles
[   28.327621] Cache flush threshold set to 1334 KiB
[   28.327828] TLB flush threshold set to 240 KiB

4.19.0-rc7-64bit+ (all for-next patches incl. Dave's and WITH alternative patching), booted with "maxcpus=1":
[    4.117685] CPU(s): 1 out of 4 PA8700 (PCX-W2) at 875.000000 MHz online
[   38.509965] Cache flush threshold set to 828 KiB
[   38.511763] Whole TLB flush 11664 cycles, Range flush 19169280 bytes 1384989 cycles
[   38.624308] Calculated TLB flush threshold 160 KiB
[   38.624477] TLB flush threshold set to 160 KiB



panama:
cpu family      : PA-RISC 2.0
cpu             : PA8900 (Shortfin)
cpu MHz         : 800.002200
capabilities    : os64 iopdir_fdc needs_equivalent_aliasing (0x35)
model           : 9000/800/rp3410
model name      : Storm Peak DC- Slow Mako+
I-cache         : 65536 KB
D-cache         : 65536 KB (WB, direct mapped)
ITLB entries    : 240
DTLB entries    : 240 - shared with ITLB
bogomips        : 1594.36

Debian kernel: 4.18.0-2-parisc64-smp
[    1.144459] CPU(s): 1 out of 1 PA8900 (Shortfin) at 800.002200 MHz online
[    1.153842] Cache flush threshold set to 39768 KiB
[    1.177785] Whole TLB flush 6231 cycles, Range flush 18874368 bytes 18987500 cycles
[    1.178038] Calculated TLB flush threshold 8 KiB
[    1.178411] TLB flush threshold set to 512 KiB

4.19.0-rc8-64bit+ (vmlinuz-4.19-rc8) plain git head, no parisc patches
[    1.105625] CPU(s): 1 out of 1 PA8900 (Shortfin) at 800.002200 MHz online
[    1.115685] Whole cache flush 4271732 cycles, flushing 19197952 bytes 2028353 cycles
[    1.115702] Cache flush threshold set to 39483 KiB
[    1.136859] Whole TLB flush 6189 cycles, flushing 19197952 bytes 16779052 cycles
[    1.136869] TLB flush threshold set to 8 KiB

4.19.0-rc7-64bit+ (all for-next patches incl. Dave's, but without alternative patching): (vmlinuz-4.19-rc7-noalternative)
[    1.233597] CPU(s): 1 out of 1 PA8900 (Shortfin) at 800.002200 MHz online
[    1.243121] Whole cache flush 4268326 cycles, flushing 19173376 bytes 2041619 cycles
[    1.243137] Cache flush threshold set to 39145 KiB
[    1.262504] Whole TLB flush 5430 cycles, Range flush 19173376 bytes 15324980 cycles
[    1.262758] Calculated TLB flush threshold 8 KiB
[    1.263126] TLB flush threshold set to 16 KiB

4.19.0-rc7-64bit+ (all for-next patches incl. Dave's and WITH alternative patching): (vmlinuz-4.19-rc7-for-next)
[    1.181601] CPU(s): 1 out of 1 PA8900 (Shortfin) at 800.002200 MHz online
[    2.662065] Whole cache flush 4287666 cycles, flushing 19169280 bytes 2040021 cycles
[    2.662087] Cache flush threshold set to 39345 KiB
[    2.663462] Whole TLB flush 7563 cycles, Range flush 19169280 bytes 940355 cycles
[    2.663718] Calculated TLB flush threshold 152 KiB
[    2.665174] TLB flush threshold set to 152 KiB


>>> Is there something wrong with SMP on panama?
>>> Oct  4 02:27:56 panama kernel: [    0.061736] smp: Bringing up secondary CPUs ...
>>> Oct  4 02:27:56 panama kernel: [    0.061897] smp: Brought up 3 nodes, 1 CPU
>> Will check tomorrow.

I think this is triggered by the 3 memory ranges which firmware report on panama:
[    0.000000] Memory Ranges:
[    0.000000]  0) Start 0x0000000000000000 End 0x000000003fffffff Size   1024 MB
[    0.000000]  1) Start 0x0000000100000000 End 0x000000013fdfffff Size   1022 MB
[    0.000000]  2) Start 0x0000004040000000 End 0x00000040ffffffff Size   3072 MB
[    0.000000] Total Memory: 5118 MB

In arch/parisc/mm/init.c:296 we have:
        for (i = 0; i < npmem_ranges; i++) {
                node_set_state(i, N_NORMAL_MEMORY);
                node_set_online(i);
        }

Not sure if it's worth fixing...(or even needs fixing).

>>> I know replacing "sync and normal store" with ordered store in spin lock release makes a
>>> significant difference in the above timing.  Plan to send patch tonight.
>> What exactly do you want me to test on panama?
> pdtlb versus pdtlb,l.  It seems pdtlb is very slow on panama.

Check numbers above.

Helge
John David Anglin Oct. 17, 2018, 4:47 p.m. UTC | #9
On 2018-10-17 10:52 AM, Helge Deller wrote:
> 4.19.0-rc7-64bit+ (all for-next patches incl. Dave's, but without alternative patching): (vmlinuz-4.19-rc7-noalternative)
> [    1.233597] CPU(s): 1 out of 1 PA8900 (Shortfin) at 800.002200 MHz online
> [    1.243121] Whole cache flush 4268326 cycles, flushing 19173376 bytes 2041619 cycles
> [    1.243137] Cache flush threshold set to 39145 KiB
> [    1.262504] Whole TLB flush 5430 cycles, Range flush 19173376 bytes 15324980 cycles
> [    1.262758] Calculated TLB flush threshold 8 KiB
> [    1.263126] TLB flush threshold set to 16 KiB
>
> 4.19.0-rc7-64bit+ (all for-next patches incl. Dave's and WITH alternative patching): (vmlinuz-4.19-rc7-for-next)
> [    1.181601] CPU(s): 1 out of 1 PA8900 (Shortfin) at 800.002200 MHz online
> [    2.662065] Whole cache flush 4287666 cycles, flushing 19169280 bytes 2040021 cycles
> [    2.662087] Cache flush threshold set to 39345 KiB
> [    2.663462] Whole TLB flush 7563 cycles, Range flush 19169280 bytes 940355 cycles
> [    2.663718] Calculated TLB flush threshold 152 KiB
> [    2.665174] TLB flush threshold set to 152 KiB
It is clear the pdtlb,l patching makes a huge difference in the range 
flushing on  rp3410.  The number of
cycles goes from 15324980 to 940355.  On c3750, the difference is about 
100 cycles per page.  We lack
numbers on rp5470.

So, we want pdtlb,l patching.

Dave
diff mbox series

Patch

diff --git a/arch/parisc/include/asm/alternative.h b/arch/parisc/include/asm/alternative.h
new file mode 100644
index 000000000000..e4835bd376bf
--- /dev/null
+++ b/arch/parisc/include/asm/alternative.h
@@ -0,0 +1,41 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_PARISC_ALTERNATIVE_H
+#define __ASM_PARISC_ALTERNATIVE_H
+
+#define INSN_PxTLB	0x02		/* modify pdtlb, pitlb */
+#define INSN_NOP	0x8000240	/* nop */
+
+
+#ifndef __ASSEMBLY__
+
+#include <linux/init.h>
+#include <linux/types.h>
+#include <linux/stddef.h>
+#include <linux/stringify.h>
+
+struct alt_instr {
+	s32 orig_offset;	/* offset to original instructions */
+	u32 len;		/* end of original instructions */
+	u32 replacement;	/* replacement instruction or code */
+};
+
+void set_kernel_text_rw(int enable_read_write);
+// int __init apply_alternatives_all(void);
+
+/* Alternative SMP implementation. */
+#define ALTERNATIVE(replacement)		"!0:"	\
+	".section .altinstructions, \"aw\"	!"	\
+	".word (0b-4-.), 1, " __stringify(replacement) "	!"	\
+	".previous"
+
+#else
+
+#define ALTERNATIVE(from, to, replacement)	\
+	.section .altinstructions, "aw"	!	\
+	.word (from - .), (to - from)/4	!	\
+	.word replacement		!	\
+	.previous
+
+#endif  /*  __ASSEMBLY__  */
+
+#endif /* __ASM_PARISC_ALTERNATIVE_H */
diff --git a/arch/parisc/include/asm/cache.h b/arch/parisc/include/asm/cache.h
index 150b7f30ea90..3e50a6e52fbd 100644
--- a/arch/parisc/include/asm/cache.h
+++ b/arch/parisc/include/asm/cache.h
@@ -43,6 +43,10 @@  void parisc_setup_cache_timing(void);
 
 #define pdtlb(addr)         asm volatile("pdtlb 0(%%sr1,%0)" : : "r" (addr));
 #define pitlb(addr)         asm volatile("pitlb 0(%%sr1,%0)" : : "r" (addr));
+#define pdtlb_alt(addr)	asm volatile("pdtlb 0(%%sr1,%0)" \
+				ALTERNATIVE(INSN_PxTLB) : : "r" (addr))
+#define pitlb_alt(addr)	asm volatile("pitlb 0(%%sr1,%0)" \
+				ALTERNATIVE(INSN_PxTLB) : : "r" (addr))
 #define pdtlb_kernel(addr)  asm volatile("pdtlb 0(%0)" : : "r" (addr));
 
 #endif /* ! __ASSEMBLY__ */
diff --git a/arch/parisc/include/asm/sections.h b/arch/parisc/include/asm/sections.h
index 5a40b51df80c..bb52aea0cb21 100644
--- a/arch/parisc/include/asm/sections.h
+++ b/arch/parisc/include/asm/sections.h
@@ -5,6 +5,8 @@ 
 /* nothing to see, move along */
 #include <asm-generic/sections.h>
 
+extern char __alt_instructions[], __alt_instructions_end[];
+
 #ifdef CONFIG_64BIT
 
 #define HAVE_DEREFERENCE_FUNCTION_DESCRIPTOR 1
diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
index 4209b74ce63c..576aff095ec8 100644
--- a/arch/parisc/kernel/cache.c
+++ b/arch/parisc/kernel/cache.c
@@ -28,6 +28,7 @@ 
 #include <asm/processor.h>
 #include <asm/sections.h>
 #include <asm/shmparam.h>
+#include <asm/alternative.h>
 
 int split_tlb __read_mostly;
 int dcache_stride __read_mostly;
@@ -483,7 +484,7 @@  int __flush_tlb_range(unsigned long sid, unsigned long start,
 		while (start < end) {
 			purge_tlb_start(flags);
 			mtsp(sid, 1);
-			pdtlb(start);
+			pdtlb_alt(start);
 			purge_tlb_end(flags);
 			start += PAGE_SIZE;
 		}
@@ -494,8 +495,8 @@  int __flush_tlb_range(unsigned long sid, unsigned long start,
 	while (start < end) {
 		purge_tlb_start(flags);
 		mtsp(sid, 1);
-		pdtlb(start);
-		pitlb(start);
+		pdtlb_alt(start);
+		pitlb_alt(start);
 		purge_tlb_end(flags);
 		start += PAGE_SIZE;
 	}
diff --git a/arch/parisc/kernel/entry.S b/arch/parisc/kernel/entry.S
index 0d662f0e7b70..66a82f69776c 100644
--- a/arch/parisc/kernel/entry.S
+++ b/arch/parisc/kernel/entry.S
@@ -38,6 +38,7 @@ 
 #include <asm/ldcw.h>
 #include <asm/traps.h>
 #include <asm/thread_info.h>
+#include <asm/alternative.h>
 
 #include <linux/linkage.h>
 
@@ -464,7 +465,7 @@ 
 	/* Acquire pa_tlb_lock lock and check page is present. */
 	.macro		tlb_lock	spc,ptp,pte,tmp,tmp1,fault
 #ifdef CONFIG_SMP
-	cmpib,COND(=),n	0,\spc,2f
+98:	cmpib,COND(=),n	0,\spc,2f
 	load_pa_tlb_lock \tmp
 1:	LDCW		0(\tmp),\tmp1
 	cmpib,COND(=)	0,\tmp1,1b
@@ -473,6 +474,7 @@ 
 	bb,<,n		\pte,_PAGE_PRESENT_BIT,3f
 	b		\fault
 	stw,ma		\spc,0(\tmp)
+99:	ALTERNATIVE(98b, 99b, INSN_NOP)
 #endif
 2:	LDREG		0(\ptp),\pte
 	bb,>=,n		\pte,_PAGE_PRESENT_BIT,\fault
@@ -482,15 +484,17 @@ 
 	/* Release pa_tlb_lock lock without reloading lock address. */
 	.macro		tlb_unlock0	spc,tmp
 #ifdef CONFIG_SMP
-	or,COND(=)	%r0,\spc,%r0
+98:	or,COND(=)	%r0,\spc,%r0
 	stw,ma		\spc,0(\tmp)
+99:	ALTERNATIVE(98b, 99b, INSN_NOP)
 #endif
 	.endm
 
 	/* Release pa_tlb_lock lock. */
 	.macro		tlb_unlock1	spc,tmp
 #ifdef CONFIG_SMP
-	load_pa_tlb_lock \tmp
+98:	load_pa_tlb_lock \tmp
+99:	ALTERNATIVE(98b, 99b, INSN_NOP)
 	tlb_unlock0	\spc,\tmp
 #endif
 	.endm
diff --git a/arch/parisc/kernel/pacache.S b/arch/parisc/kernel/pacache.S
index f33bf2d306d6..11801b502352 100644
--- a/arch/parisc/kernel/pacache.S
+++ b/arch/parisc/kernel/pacache.S
@@ -37,6 +37,7 @@ 
 #include <asm/pgtable.h>
 #include <asm/cache.h>
 #include <asm/ldcw.h>
+#include <asm/alternative.h>
 #include <linux/linkage.h>
 #include <linux/init.h>
 
@@ -312,6 +313,7 @@  ENDPROC_CFI(flush_data_cache_local)
 
 	.macro	tlb_lock	la,flags,tmp
 #ifdef CONFIG_SMP
+98:
 #if __PA_LDCW_ALIGNMENT > 4
 	load32		pa_tlb_lock + __PA_LDCW_ALIGNMENT-1, \la
 	depi		0,31,__PA_LDCW_ALIGN_ORDER, \la
@@ -326,15 +328,17 @@  ENDPROC_CFI(flush_data_cache_local)
 	nop
 	b,n		2b
 3:
+99:	ALTERNATIVE(98b, 99b, INSN_NOP)
 #endif
 	.endm
 
 	.macro	tlb_unlock	la,flags,tmp
 #ifdef CONFIG_SMP
-	ldi		1,\tmp
+98:	ldi		1,\tmp
 	sync
 	stw		\tmp,0(\la)
 	mtsm		\flags
+99:	ALTERNATIVE(98b, 99b, INSN_NOP)
 #endif
 	.endm
 
@@ -596,9 +600,11 @@  ENTRY_CFI(copy_user_page_asm)
 	pdtlb,l		%r0(%r29)
 #else
 	tlb_lock	%r20,%r21,%r22
-	pdtlb		%r0(%r28)
-	pdtlb		%r0(%r29)
+0:	pdtlb		%r0(%r28)
+1:	pdtlb		%r0(%r29)
 	tlb_unlock	%r20,%r21,%r22
+	ALTERNATIVE(0b, 0b, INSN_PxTLB)
+	ALTERNATIVE(1b, 1b, INSN_PxTLB)
 #endif
 
 #ifdef CONFIG_64BIT
@@ -736,8 +742,9 @@  ENTRY_CFI(clear_user_page_asm)
 	pdtlb,l		%r0(%r28)
 #else
 	tlb_lock	%r20,%r21,%r22
-	pdtlb		%r0(%r28)
+0:	pdtlb		%r0(%r28)
 	tlb_unlock	%r20,%r21,%r22
+	ALTERNATIVE(0b, 0b, INSN_PxTLB)
 #endif
 
 #ifdef CONFIG_64BIT
@@ -813,8 +820,9 @@  ENTRY_CFI(flush_dcache_page_asm)
 	pdtlb,l		%r0(%r28)
 #else
 	tlb_lock	%r20,%r21,%r22
-	pdtlb		%r0(%r28)
+0:	pdtlb		%r0(%r28)
 	tlb_unlock	%r20,%r21,%r22
+	ALTERNATIVE(0b, 0b, INSN_PxTLB)
 #endif
 
 	ldil		L%dcache_stride, %r1
@@ -877,9 +885,11 @@  ENTRY_CFI(flush_icache_page_asm)
 	pitlb,l         %r0(%sr4,%r28)
 #else
 	tlb_lock        %r20,%r21,%r22
-	pdtlb		%r0(%r28)
-	pitlb           %r0(%sr4,%r28)
+0:	pdtlb		%r0(%r28)
+1:	pitlb           %r0(%sr4,%r28)
 	tlb_unlock      %r20,%r21,%r22
+	ALTERNATIVE(0b, 0b, INSN_PxTLB)
+	ALTERNATIVE(1b, 1b, INSN_PxTLB)
 #endif
 
 	ldil		L%icache_stride, %r1
diff --git a/arch/parisc/kernel/setup.c b/arch/parisc/kernel/setup.c
index 4e87c35c22b7..7fa151b5eb40 100644
--- a/arch/parisc/kernel/setup.c
+++ b/arch/parisc/kernel/setup.c
@@ -40,6 +40,7 @@ 
 #include <linux/sched/clock.h>
 #include <linux/start_kernel.h>
 
+#include <asm/alternative.h>
 #include <asm/processor.h>
 #include <asm/sections.h>
 #include <asm/pdc.h>
@@ -305,6 +306,55 @@  static int __init parisc_init_resources(void)
 	return 0;
 }
 
+static int __init apply_alternatives_all(void)
+{
+	struct alt_instr *entry;
+	int *from, len;
+	int ret = 0, replacement;
+
+	/* replace only when not running SMP CPUs */
+	if (num_online_cpus() > 1)
+		return 0;
+
+	pr_info("Patch SMP kernel to run on a single CPU.\n");
+
+	set_kernel_text_rw(1);
+
+	entry = (struct alt_instr *) &__alt_instructions;
+	while (entry < (struct alt_instr *) &__alt_instructions_end) {
+		from = (int *)((ulong)&entry->orig_offset + entry->orig_offset);
+		len = entry->len;
+
+		replacement = entry->replacement;
+
+		/* Want to replace pdtlb by a pdtlb,l instruction? */
+		if (replacement == INSN_PxTLB) {
+			replacement = *from;
+			if (boot_cpu_data.cpu_type >= pcxu) /* >= pa2.0 ? */
+				replacement |= (1 << 10); /* set el bit */
+		}
+
+		/*
+		 * Replace instruction with NOPs?
+		 * For long distance insert a branch instruction instead.
+		 */
+		if (replacement == INSN_NOP && len > 1)
+			replacement = 0xe8000002 + (len-2)*8; /* "b,n .+8" */
+
+		pr_debug("Replace %02d instructions @ 0x%px with 0x%08x\n",
+			len, from, replacement);
+
+		/* Replace instructions */
+		*from = replacement;
+
+		entry++;
+	}
+
+	set_kernel_text_rw(0);
+
+	return ret;
+}
+
 extern void gsc_init(void);
 extern void processor_init(void);
 extern void ccio_init(void);
@@ -346,6 +396,7 @@  static int __init parisc_init(void)
 			boot_cpu_data.cpu_hz / 1000000,
 			boot_cpu_data.cpu_hz % 1000000	);
 
+	apply_alternatives_all();
 	parisc_setup_cache_timing();
 
 	/* These are in a non-obvious order, will fix when we have an iotree */
diff --git a/arch/parisc/kernel/vmlinux.lds.S b/arch/parisc/kernel/vmlinux.lds.S
index da2e31190efa..ef721fc3671b 100644
--- a/arch/parisc/kernel/vmlinux.lds.S
+++ b/arch/parisc/kernel/vmlinux.lds.S
@@ -25,7 +25,7 @@ 
 #include <asm/page.h>
 #include <asm/asm-offsets.h>
 #include <asm/thread_info.h>
-	
+
 /* ld script to make hppa Linux kernel */
 #ifndef CONFIG_64BIT
 OUTPUT_FORMAT("elf32-hppa-linux")
@@ -61,6 +61,12 @@  SECTIONS
 		EXIT_DATA
 	}
 	PERCPU_SECTION(8)
+	. = ALIGN(4);
+	.altinstructions : {
+		__alt_instructions = .;
+		*(.altinstructions)
+		__alt_instructions_end = .;
+	}
 	. = ALIGN(HUGEPAGE_SIZE);
 	__init_end = .;
 	/* freed after init ends here */
diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
index 74842d28a7a1..ff80ffdd09c7 100644
--- a/arch/parisc/mm/init.c
+++ b/arch/parisc/mm/init.c
@@ -515,6 +512,21 @@  static void __init map_pages(unsigned long start_vaddr,
 	}
 }
 
+void __init set_kernel_text_rw(int enable_read_write)
+{
+	unsigned long start = (unsigned long)_stext;
+	unsigned long end   = (unsigned long)_etext;
+
+	map_pages(start, __pa(start), end-start,
+		PAGE_KERNEL_RWX, enable_read_write ? 1:0);
+
+	/* force the kernel to see the new TLB entries */
+	__flush_tlb_range(0, start, end);
+
+	/* dump old cached instructions */
+	flush_icache_range(start, end);
+}
+
 void __ref free_initmem(void)
 {
 	unsigned long init_begin = (unsigned long)__init_begin;