diff mbox series

[V3] parisc: Rewrite cache flush code for PA8800/PA8900

Message ID YoJqZ2rUA25360Ld@mx3210.localdomain (mailing list archive)
State Superseded, archived
Headers show
Series [V3] parisc: Rewrite cache flush code for PA8800/PA8900 | expand

Commit Message

John David Anglin May 16, 2022, 3:14 p.m. UTC
Originally, I was convinced that we needed to use tmpalias flushes
everwhere, for both user and kernel flushes. However, when I modified
flush_kernel_dcache_page_addr, to use a tmpalias flush, my c8000
would crash quite early when booting.

The PDC returns alias values of 0 for the icache and dcache. This
indicates that either the alias boundary is greater than 16MB or
equivalent aliasing doesn't work. I modified the tmpalias code to
make it easy to try alternate boundaries. I tried boundaries up to
128MB but still kernel tmpalias flushes didn't work on c8000.

This led me to conclude that tmpalias flushes don't work on PA8800
and PA8900 machines, and that we needed to flush directly using the
virtual address of user and kernel pages. This is likely the major
cause of instability on the c8000 and rp34xx machines.

Flushing user pages requires doing a temporary context switch as we
have to flush pages that don't belong to the current context. Further,
we have to deal with pages that aren't present. If a page isn't
present, the flush instructions fault on every line.

Other code has been rearranged and simplified based on testing. For
example, I introduced a flush_cache_dup_mm routine. flush_cache_mm
and flush_cache_dup_mm differ in that flush_cache_mm calls
purge_cache_pages and flush_cache_dup_mm calls flush_cache_pages.
In some implementations, pdc is more efficient than fdc. Based on
my testing, I don't believe there's any performance benefit on the
c8000.

V2:
1) Add flush_cache_page_check_pte routine.
2) Use it in copy_to_user_page and copy_from_user_page.
3) flush_anon_page moved to cache.c and updated.
4) Changed alias boundary to 64 MB for 64-bit kernels. Revised comment
   regarding alias boundary for PA8800/PA8900 processors.
5) Removed struct mm_struct * argument from flush_cache_pages.
6) Fixed thinko in flush_cache_range. It increased the number of pages
   flushed and slowed performance.
7) Removed sync changes from pacache.S.

V3:
1) copy_to_user_page and copy_from_user_page moved to cache.c to
   improve inlining.
2) Replaced copy_user_page with copy_user_highpage.
3) Fixed cache threshold calculation on 32-bit kernels.
4) Don't warn on inequivalent private mappings in flush_dcache_page.
5) Return early from mm_total_size if size exceeds
   parisc_cache_flush_threshold.
6) Disable non-access TLB warning in handle_nadtlb_fault. Warning
   happens occassionally handling flushes for COW faults.
7) Remove flush_cache_dup_mm.
8) Flush entire cache in flush_cache_mm and flush_cache_range on
   processors with aliasing caches. Only flush small cache ranges
   on machines with PA8800/PA8900 processors.
9) Tested on rp3440, c8000 and c3750.

Signed-off-by: John David Anglin <dave.anglin@bell.net>
---

Comments

Rolf Eike Beer May 16, 2022, 9:28 p.m. UTC | #1
Am Montag, 16. Mai 2022, 17:14:47 CEST schrieb John David Anglin:
> Originally, I was convinced that we needed to use tmpalias flushes
> everwhere, for both user and kernel flushes. However, when I modified
> flush_kernel_dcache_page_addr, to use a tmpalias flush, my c8000
> would crash quite early when booting.
> 
> The PDC returns alias values of 0 for the icache and dcache. This
> indicates that either the alias boundary is greater than 16MB or
> equivalent aliasing doesn't work. I modified the tmpalias code to
> make it easy to try alternate boundaries. I tried boundaries up to
> 128MB but still kernel tmpalias flushes didn't work on c8000.
> 
> This led me to conclude that tmpalias flushes don't work on PA8800
> and PA8900 machines, and that we needed to flush directly using the
> virtual address of user and kernel pages. This is likely the major
> cause of instability on the c8000 and rp34xx machines.
> 
> Flushing user pages requires doing a temporary context switch as we
> have to flush pages that don't belong to the current context. Further,
> we have to deal with pages that aren't present. If a page isn't
> present, the flush instructions fault on every line.
> 
> Other code has been rearranged and simplified based on testing. For
> example, I introduced a flush_cache_dup_mm routine. flush_cache_mm
> and flush_cache_dup_mm differ in that flush_cache_mm calls
> purge_cache_pages and flush_cache_dup_mm calls flush_cache_pages.
> In some implementations, pdc is more efficient than fdc. Based on
> my testing, I don't believe there's any performance benefit on the
> c8000.
> 
> V2:
> 1) Add flush_cache_page_check_pte routine.
> 2) Use it in copy_to_user_page and copy_from_user_page.
> 3) flush_anon_page moved to cache.c and updated.
> 4) Changed alias boundary to 64 MB for 64-bit kernels. Revised comment
>    regarding alias boundary for PA8800/PA8900 processors.
> 5) Removed struct mm_struct * argument from flush_cache_pages.
> 6) Fixed thinko in flush_cache_range. It increased the number of pages
>    flushed and slowed performance.
> 7) Removed sync changes from pacache.S.
> 
> V3:
> 1) copy_to_user_page and copy_from_user_page moved to cache.c to
>    improve inlining.
> 2) Replaced copy_user_page with copy_user_highpage.
> 3) Fixed cache threshold calculation on 32-bit kernels.
> 4) Don't warn on inequivalent private mappings in flush_dcache_page.
> 5) Return early from mm_total_size if size exceeds
>    parisc_cache_flush_threshold.
> 6) Disable non-access TLB warning in handle_nadtlb_fault. Warning
>    happens occassionally handling flushes for COW faults.
> 7) Remove flush_cache_dup_mm.
> 8) Flush entire cache in flush_cache_mm and flush_cache_range on
>    processors with aliasing caches. Only flush small cache ranges
>    on machines with PA8800/PA8900 processors.
> 9) Tested on rp3440, c8000 and c3750.

Given how long these changelogs are, and how fragile the whole caching is I 
think it is a good idea to split this patch into smaller ones, to improve 
readability and being able to bisect it.
Helge Deller May 16, 2022, 9:49 p.m. UTC | #2
On 5/16/22 23:28, Rolf Eike Beer wrote:
> Am Montag, 16. Mai 2022, 17:14:47 CEST schrieb John David Anglin:
>> Originally, I was convinced that we needed to use tmpalias flushes
>> everwhere, for both user and kernel flushes. However, when I modified
>> flush_kernel_dcache_page_addr, to use a tmpalias flush, my c8000
>> would crash quite early when booting.
>>
>> The PDC returns alias values of 0 for the icache and dcache. This
>> indicates that either the alias boundary is greater than 16MB or
>> equivalent aliasing doesn't work. I modified the tmpalias code to
>> make it easy to try alternate boundaries. I tried boundaries up to
>> 128MB but still kernel tmpalias flushes didn't work on c8000.
>>
>> This led me to conclude that tmpalias flushes don't work on PA8800
>> and PA8900 machines, and that we needed to flush directly using the
>> virtual address of user and kernel pages. This is likely the major
>> cause of instability on the c8000 and rp34xx machines.
>>
>> Flushing user pages requires doing a temporary context switch as we
>> have to flush pages that don't belong to the current context. Further,
>> we have to deal with pages that aren't present. If a page isn't
>> present, the flush instructions fault on every line.
>>
>> Other code has been rearranged and simplified based on testing. For
>> example, I introduced a flush_cache_dup_mm routine. flush_cache_mm
>> and flush_cache_dup_mm differ in that flush_cache_mm calls
>> purge_cache_pages and flush_cache_dup_mm calls flush_cache_pages.
>> In some implementations, pdc is more efficient than fdc. Based on
>> my testing, I don't believe there's any performance benefit on the
>> c8000.
>>
>> V2:
>> 1) Add flush_cache_page_check_pte routine.
>> 2) Use it in copy_to_user_page and copy_from_user_page.
>> 3) flush_anon_page moved to cache.c and updated.
>> 4) Changed alias boundary to 64 MB for 64-bit kernels. Revised comment
>>    regarding alias boundary for PA8800/PA8900 processors.
>> 5) Removed struct mm_struct * argument from flush_cache_pages.
>> 6) Fixed thinko in flush_cache_range. It increased the number of pages
>>    flushed and slowed performance.
>> 7) Removed sync changes from pacache.S.
>>
>> V3:
>> 1) copy_to_user_page and copy_from_user_page moved to cache.c to
>>    improve inlining.
>> 2) Replaced copy_user_page with copy_user_highpage.
>> 3) Fixed cache threshold calculation on 32-bit kernels.
>> 4) Don't warn on inequivalent private mappings in flush_dcache_page.
>> 5) Return early from mm_total_size if size exceeds
>>    parisc_cache_flush_threshold.
>> 6) Disable non-access TLB warning in handle_nadtlb_fault. Warning
>>    happens occassionally handling flushes for COW faults.
>> 7) Remove flush_cache_dup_mm.
>> 8) Flush entire cache in flush_cache_mm and flush_cache_range on
>>    processors with aliasing caches. Only flush small cache ranges
>>    on machines with PA8800/PA8900 processors.
>> 9) Tested on rp3440, c8000 and c3750.
>
> Given how long these changelogs are, and how fragile the whole caching is I
> think it is a good idea to split this patch into smaller ones, to improve
> readability and being able to bisect it.

FWIW, I've done some cleanups to this patch and committed it to my for-next tree.
In case it's split up, please use the revised version.

Helge
Sam James May 16, 2022, 10:09 p.m. UTC | #3
> On 16 May 2022, at 22:49, Helge Deller <deller@gmx.de> wrote:
> 
> [snip]
Hi Helge,

> FWIW, I've done some cleanups to this patch and committed it to my for-next tree.
> In case it's split up, please use the revised version.
> 

Should I be testing with for-next (which contains this patch) or for-next-next (which has some smaller bits)?

best,
sam

> Helge
Helge Deller May 16, 2022, 10:24 p.m. UTC | #4
On 5/17/22 00:09, Sam James wrote:
>> On 16 May 2022, at 22:49, Helge Deller <deller@gmx.de> wrote:
>>
>> [snip]
> Hi Helge,
>
>> FWIW, I've done some cleanups to this patch and committed it to my for-next tree.
>> In case it's split up, please use the revised version.
>>
>
> Should I be testing with for-next (which contains this patch) or for-next-next (which has some smaller bits)?

for-next is for v5.18.

for-next-next is planed to for v5.19

so, please use for-next, since we want to get 5.18 finally fixed.

Thanks!
Helge
John David Anglin May 16, 2022, 10:24 p.m. UTC | #5
On 2022-05-16 6:09 p.m., Sam James wrote:
>> FWIW, I've done some cleanups to this patch and committed it to my for-next tree.
>> In case it's split up, please use the revised version.
>>
> Should I be testing with for-next (which contains this patch) or for-next-next (which has some smaller bits)?
The change in for-next-next shouldn't affect functionality.  The patch for-next is the one to test.

Dave
Sam James May 16, 2022, 10:54 p.m. UTC | #6
> On 16 May 2022, at 23:24, Helge Deller <deller@gmx.de> wrote:
> 
> On 5/17/22 00:09, Sam James wrote:
>>> On 16 May 2022, at 22:49, Helge Deller <deller@gmx.de> wrote:
>>> 
>>> [snip]
>> Hi Helge,
>> 
>>> FWIW, I've done some cleanups to this patch and committed it to my for-next tree.
>>> In case it's split up, please use the revised version.
>>> 
>> 
>> Should I be testing with for-next (which contains this patch) or for-next-next (which has some smaller bits)?
> 
> for-next is for v5.18.
> 
> for-next-next is planed to for v5.19
> 
> so, please use for-next, since we want to get 5.18 finally fixed.
> 

Got it, thanks to both you and Dave!

I'll be testing it on RP3440 (PA8800) and C8000 (PA8900). If you need it, I can test on C3600 (PA8600) too.

Best,
sam

> Thanks!
> Helge
John David Anglin May 16, 2022, 11:41 p.m. UTC | #7
On 2022-05-16 6:54 p.m., Sam James wrote:
> I'll be testing it on RP3440 (PA8800) and C8000 (PA8900). If you need it, I can test on C3600 (PA8600) too.
I would test all three.  The latest version affects them all.

Watch for random segmentation faults, malloc corruption, etc.

Thanks,
Dave
Rolf Eike Beer May 17, 2022, 1:06 p.m. UTC | #8
Am Montag, 16. Mai 2022, 17:14:47 CEST schrieb John David Anglin:
> Originally, I was convinced that we needed to use tmpalias flushes
> everwhere, for both user and kernel flushes. However, when I modified
> flush_kernel_dcache_page_addr, to use a tmpalias flush, my c8000
> would crash quite early when booting.
> 
> The PDC returns alias values of 0 for the icache and dcache. This
> indicates that either the alias boundary is greater than 16MB or
> equivalent aliasing doesn't work. I modified the tmpalias code to
> make it easy to try alternate boundaries. I tried boundaries up to
> 128MB but still kernel tmpalias flushes didn't work on c8000.
> 
> This led me to conclude that tmpalias flushes don't work on PA8800
> and PA8900 machines, and that we needed to flush directly using the
> virtual address of user and kernel pages. This is likely the major
> cause of instability on the c8000 and rp34xx machines.
> 
> Flushing user pages requires doing a temporary context switch as we
> have to flush pages that don't belong to the current context. Further,
> we have to deal with pages that aren't present. If a page isn't
> present, the flush instructions fault on every line.
> 
> Other code has been rearranged and simplified based on testing. For
> example, I introduced a flush_cache_dup_mm routine. flush_cache_mm
> and flush_cache_dup_mm differ in that flush_cache_mm calls
> purge_cache_pages and flush_cache_dup_mm calls flush_cache_pages.
> In some implementations, pdc is more efficient than fdc. Based on
> my testing, I don't believe there's any performance benefit on the
> c8000.

> diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
> index f114e102aaf2..ca49765784fc 100644
> --- a/arch/parisc/mm/fault.c
> +++ b/arch/parisc/mm/fault.c
> @@ -22,6 +22,8 @@
> 
>  #include <asm/traps.h>
> 
> +/* #define DEBUG_NATLB 1 */
> +
>  /* Various important other fields */
>  #define bit22set(x)		(x & 0x00000200)
>  #define bits23_25set(x)		(x & 0x000001c0)
> @@ -450,10 +452,12 @@ handle_nadtlb_fault(struct pt_regs *regs)
>  		fallthrough;
>  	case 0x380:
>  		/* PDC and FIC instructions */
> +#ifdef DEBUG_NATLB
>  		if (printk_ratelimit()) {
> -			pr_warn("BUG: nullifying cache flush/purge 
instruction\n");
> +			pr_warn("WARNING: nullifying cache flush/
purge instruction\n");
>  			show_regs(regs);
>  		}
> +#endif
>  		if (insn & 0x20) {
>  			/* Base modification */
>  			breg = (insn >> 21) & 0x1f;

This surely deserves it's own commit as it has nothing to do with the actual 
change. I wonder if it is actually intended to go upstream or if this was just 
a local debug hack?

Eike
Rolf Eike Beer May 17, 2022, 1:19 p.m. UTC | #9
Am Montag, 16. Mai 2022, 23:49:10 CEST schrieb Helge Deller:
> On 5/16/22 23:28, Rolf Eike Beer wrote:
> > Am Montag, 16. Mai 2022, 17:14:47 CEST schrieb John David Anglin:
> >> Originally, I was convinced that we needed to use tmpalias flushes
> >> everwhere, for both user and kernel flushes. However, when I modified
> >> flush_kernel_dcache_page_addr, to use a tmpalias flush, my c8000
> >> would crash quite early when booting.
> >> 
> >> The PDC returns alias values of 0 for the icache and dcache. This
> >> indicates that either the alias boundary is greater than 16MB or
> >> equivalent aliasing doesn't work. I modified the tmpalias code to
> >> make it easy to try alternate boundaries. I tried boundaries up to
> >> 128MB but still kernel tmpalias flushes didn't work on c8000.
> >> 
> >> This led me to conclude that tmpalias flushes don't work on PA8800
> >> and PA8900 machines, and that we needed to flush directly using the
> >> virtual address of user and kernel pages. This is likely the major
> >> cause of instability on the c8000 and rp34xx machines.
> >> 
> >> Flushing user pages requires doing a temporary context switch as we
> >> have to flush pages that don't belong to the current context. Further,
> >> we have to deal with pages that aren't present. If a page isn't
> >> present, the flush instructions fault on every line.
> >> 
> >> Other code has been rearranged and simplified based on testing. For
> >> example, I introduced a flush_cache_dup_mm routine. flush_cache_mm
> >> and flush_cache_dup_mm differ in that flush_cache_mm calls
> >> purge_cache_pages and flush_cache_dup_mm calls flush_cache_pages.
> >> In some implementations, pdc is more efficient than fdc. Based on
> >> my testing, I don't believe there's any performance benefit on the
> >> c8000.
> >> 
> >> V2:
> >> 1) Add flush_cache_page_check_pte routine.
> >> 2) Use it in copy_to_user_page and copy_from_user_page.
> >> 3) flush_anon_page moved to cache.c and updated.
> >> 4) Changed alias boundary to 64 MB for 64-bit kernels. Revised comment
> >> 
> >>    regarding alias boundary for PA8800/PA8900 processors.
> >> 
> >> 5) Removed struct mm_struct * argument from flush_cache_pages.
> >> 6) Fixed thinko in flush_cache_range. It increased the number of pages
> >> 
> >>    flushed and slowed performance.
> >> 
> >> 7) Removed sync changes from pacache.S.
> >> 
> >> V3:
> >> 1) copy_to_user_page and copy_from_user_page moved to cache.c to
> >> 
> >>    improve inlining.
> >> 
> >> 2) Replaced copy_user_page with copy_user_highpage.
> >> 3) Fixed cache threshold calculation on 32-bit kernels.
> >> 4) Don't warn on inequivalent private mappings in flush_dcache_page.
> >> 5) Return early from mm_total_size if size exceeds
> >> 
> >>    parisc_cache_flush_threshold.
> >> 
> >> 6) Disable non-access TLB warning in handle_nadtlb_fault. Warning
> >> 
> >>    happens occassionally handling flushes for COW faults.
> >> 
> >> 7) Remove flush_cache_dup_mm.
> >> 8) Flush entire cache in flush_cache_mm and flush_cache_range on
> >> 
> >>    processors with aliasing caches. Only flush small cache ranges
> >>    on machines with PA8800/PA8900 processors.
> >> 
> >> 9) Tested on rp3440, c8000 and c3750.
> > 
> > Given how long these changelogs are, and how fragile the whole caching is
> > I
> > think it is a good idea to split this patch into smaller ones, to improve
> > readability and being able to bisect it.
> 
> FWIW, I've done some cleanups to this patch and committed it to my for-next
> tree. In case it's split up, please use the revised version.

Why did you modify get_ptep()? Until now it was just moved around in the file, 
and IMHO it becomes less readable because all these needless variables are 
batched up at the start of the function now. The only point I would see in 
moving them all to the front is if there would be no nesting anymore, and 
every condition was inverted:

if (pgd_none(*pgd))
	return NULL;

[…]

return pte_offset_map(pmd, addr);

Bit this would as well be a different commit.
Helge Deller May 17, 2022, 1:24 p.m. UTC | #10
On 5/17/22 15:19, Rolf Eike Beer wrote:
> Am Montag, 16. Mai 2022, 23:49:10 CEST schrieb Helge Deller:
>> On 5/16/22 23:28, Rolf Eike Beer wrote:
>>> Am Montag, 16. Mai 2022, 17:14:47 CEST schrieb John David Anglin:
>>>> Originally, I was convinced that we needed to use tmpalias flushes
>>>> everwhere, for both user and kernel flushes. However, when I modified
>>>> flush_kernel_dcache_page_addr, to use a tmpalias flush, my c8000
>>>> would crash quite early when booting.
>>>>
>>>> The PDC returns alias values of 0 for the icache and dcache. This
>>>> indicates that either the alias boundary is greater than 16MB or
>>>> equivalent aliasing doesn't work. I modified the tmpalias code to
>>>> make it easy to try alternate boundaries. I tried boundaries up to
>>>> 128MB but still kernel tmpalias flushes didn't work on c8000.
>>>>
>>>> This led me to conclude that tmpalias flushes don't work on PA8800
>>>> and PA8900 machines, and that we needed to flush directly using the
>>>> virtual address of user and kernel pages. This is likely the major
>>>> cause of instability on the c8000 and rp34xx machines.
>>>>
>>>> Flushing user pages requires doing a temporary context switch as we
>>>> have to flush pages that don't belong to the current context. Further,
>>>> we have to deal with pages that aren't present. If a page isn't
>>>> present, the flush instructions fault on every line.
>>>>
>>>> Other code has been rearranged and simplified based on testing. For
>>>> example, I introduced a flush_cache_dup_mm routine. flush_cache_mm
>>>> and flush_cache_dup_mm differ in that flush_cache_mm calls
>>>> purge_cache_pages and flush_cache_dup_mm calls flush_cache_pages.
>>>> In some implementations, pdc is more efficient than fdc. Based on
>>>> my testing, I don't believe there's any performance benefit on the
>>>> c8000.
>>>>
>>>> V2:
>>>> 1) Add flush_cache_page_check_pte routine.
>>>> 2) Use it in copy_to_user_page and copy_from_user_page.
>>>> 3) flush_anon_page moved to cache.c and updated.
>>>> 4) Changed alias boundary to 64 MB for 64-bit kernels. Revised comment
>>>>
>>>>    regarding alias boundary for PA8800/PA8900 processors.
>>>>
>>>> 5) Removed struct mm_struct * argument from flush_cache_pages.
>>>> 6) Fixed thinko in flush_cache_range. It increased the number of pages
>>>>
>>>>    flushed and slowed performance.
>>>>
>>>> 7) Removed sync changes from pacache.S.
>>>>
>>>> V3:
>>>> 1) copy_to_user_page and copy_from_user_page moved to cache.c to
>>>>
>>>>    improve inlining.
>>>>
>>>> 2) Replaced copy_user_page with copy_user_highpage.
>>>> 3) Fixed cache threshold calculation on 32-bit kernels.
>>>> 4) Don't warn on inequivalent private mappings in flush_dcache_page.
>>>> 5) Return early from mm_total_size if size exceeds
>>>>
>>>>    parisc_cache_flush_threshold.
>>>>
>>>> 6) Disable non-access TLB warning in handle_nadtlb_fault. Warning
>>>>
>>>>    happens occassionally handling flushes for COW faults.
>>>>
>>>> 7) Remove flush_cache_dup_mm.
>>>> 8) Flush entire cache in flush_cache_mm and flush_cache_range on
>>>>
>>>>    processors with aliasing caches. Only flush small cache ranges
>>>>    on machines with PA8800/PA8900 processors.
>>>>
>>>> 9) Tested on rp3440, c8000 and c3750.
>>>
>>> Given how long these changelogs are, and how fragile the whole caching is
>>> I
>>> think it is a good idea to split this patch into smaller ones, to improve
>>> readability and being able to bisect it.
>>
>> FWIW, I've done some cleanups to this patch and committed it to my for-next
>> tree. In case it's split up, please use the revised version.
>
> Why did you modify get_ptep()? Until now it was just moved around in the file,
> and IMHO it becomes less readable because all these needless variables are
> batched up at the start of the function now. The only point I would see in
> moving them all to the front is if there would be no nesting anymore, and
> every condition was inverted:

Dave's original patch did not moved the variables to the beginning.
That change was me - just because checkpatch complained otherwise.

I agree that it's less readable.

Helge
John David Anglin May 17, 2022, 2:05 p.m. UTC | #11
On 2022-05-17 9:06 a.m., Rolf Eike Beer wrote:
>> @@ -450,10 +452,12 @@ handle_nadtlb_fault(struct pt_regs *regs)
>>   		fallthrough;
>>   	case 0x380:
>>   		/* PDC and FIC instructions */
>> +#ifdef DEBUG_NATLB
>>   		if (printk_ratelimit()) {
>> -			pr_warn("BUG: nullifying cache flush/purge
> instruction\n");
>> +			pr_warn("WARNING: nullifying cache flush/
> purge instruction\n");
>>   			show_regs(regs);
>>   		}
>> +#endif
>>   		if (insn & 0x20) {
>>   			/* Base modification */
>>   			breg = (insn >> 21) & 0x1f;
> This surely deserves it's own commit as it has nothing to do with the actual
> change. I wonder if it is actually intended to go upstream or if this was just
> a local debug hack?
I changed "BUG" to "WARNING" and disabled the message because it triggers occasionally in spite of
the check in flush_cache_page_if_present.

The pte value extracted for the "from" page in copy_user_highpage is racy and occasionally the pte is
cleared before the flush is complete.  I assume that the page is simultaneously flushed by flush_cache_mm
before the pte is cleared as nullifying the fdc doesn't seem to cause problems.

I investigated various locking scenarios but I wasn't able to find a way to sequence the flushes.  This
code is called for every COW break and locks impact performance.

This is related to this patch because we need the pte on PA8800/PA8900 to flush using the vma context.

I have also seen this from copy_to_user_page and copy_from_user_page.

The messages appear infrequently when enabled.

Dave
John David Anglin May 17, 2022, 2:26 p.m. UTC | #12
On 2022-05-17 9:24 a.m., Helge Deller wrote:
> On 5/17/22 15:19, Rolf Eike Beer wrote:
>> Am Montag, 16. Mai 2022, 23:49:10 CEST schrieb Helge Deller:
>>> On 5/16/22 23:28, Rolf Eike Beer wrote:
>>>> Am Montag, 16. Mai 2022, 17:14:47 CEST schrieb John David Anglin:
>>>>> Originally, I was convinced that we needed to use tmpalias flushes
>>>>> everwhere, for both user and kernel flushes. However, when I modified
>>>>> flush_kernel_dcache_page_addr, to use a tmpalias flush, my c8000
>>>>> would crash quite early when booting.
>>>>>
>>>>> The PDC returns alias values of 0 for the icache and dcache. This
>>>>> indicates that either the alias boundary is greater than 16MB or
>>>>> equivalent aliasing doesn't work. I modified the tmpalias code to
>>>>> make it easy to try alternate boundaries. I tried boundaries up to
>>>>> 128MB but still kernel tmpalias flushes didn't work on c8000.
>>>>>
>>>>> This led me to conclude that tmpalias flushes don't work on PA8800
>>>>> and PA8900 machines, and that we needed to flush directly using the
>>>>> virtual address of user and kernel pages. This is likely the major
>>>>> cause of instability on the c8000 and rp34xx machines.
>>>>>
>>>>> Flushing user pages requires doing a temporary context switch as we
>>>>> have to flush pages that don't belong to the current context. Further,
>>>>> we have to deal with pages that aren't present. If a page isn't
>>>>> present, the flush instructions fault on every line.
>>>>>
>>>>> Other code has been rearranged and simplified based on testing. For
>>>>> example, I introduced a flush_cache_dup_mm routine. flush_cache_mm
>>>>> and flush_cache_dup_mm differ in that flush_cache_mm calls
>>>>> purge_cache_pages and flush_cache_dup_mm calls flush_cache_pages.
>>>>> In some implementations, pdc is more efficient than fdc. Based on
>>>>> my testing, I don't believe there's any performance benefit on the
>>>>> c8000.
>>>>>
>>>>> V2:
>>>>> 1) Add flush_cache_page_check_pte routine.
>>>>> 2) Use it in copy_to_user_page and copy_from_user_page.
>>>>> 3) flush_anon_page moved to cache.c and updated.
>>>>> 4) Changed alias boundary to 64 MB for 64-bit kernels. Revised comment
>>>>>
>>>>>     regarding alias boundary for PA8800/PA8900 processors.
>>>>>
>>>>> 5) Removed struct mm_struct * argument from flush_cache_pages.
>>>>> 6) Fixed thinko in flush_cache_range. It increased the number of pages
>>>>>
>>>>>     flushed and slowed performance.
>>>>>
>>>>> 7) Removed sync changes from pacache.S.
>>>>>
>>>>> V3:
>>>>> 1) copy_to_user_page and copy_from_user_page moved to cache.c to
>>>>>
>>>>>     improve inlining.
>>>>>
>>>>> 2) Replaced copy_user_page with copy_user_highpage.
>>>>> 3) Fixed cache threshold calculation on 32-bit kernels.
>>>>> 4) Don't warn on inequivalent private mappings in flush_dcache_page.
>>>>> 5) Return early from mm_total_size if size exceeds
>>>>>
>>>>>     parisc_cache_flush_threshold.
>>>>>
>>>>> 6) Disable non-access TLB warning in handle_nadtlb_fault. Warning
>>>>>
>>>>>     happens occassionally handling flushes for COW faults.
>>>>>
>>>>> 7) Remove flush_cache_dup_mm.
>>>>> 8) Flush entire cache in flush_cache_mm and flush_cache_range on
>>>>>
>>>>>     processors with aliasing caches. Only flush small cache ranges
>>>>>     on machines with PA8800/PA8900 processors.
>>>>>
>>>>> 9) Tested on rp3440, c8000 and c3750.
>>>> Given how long these changelogs are, and how fragile the whole caching is
>>>> I
>>>> think it is a good idea to split this patch into smaller ones, to improve
>>>> readability and being able to bisect it.
>>> FWIW, I've done some cleanups to this patch and committed it to my for-next
>>> tree. In case it's split up, please use the revised version.
>> Why did you modify get_ptep()? Until now it was just moved around in the file,
>> and IMHO it becomes less readable because all these needless variables are
>> batched up at the start of the function now. The only point I would see in
>> moving them all to the front is if there would be no nesting anymore, and
>> every condition was inverted:
> Dave's original patch did not moved the variables to the beginning.
> That change was me - just because checkpatch complained otherwise.
>
> I agree that it's less readable.
The get_ptep code was originally based on the vmalloc_to_page code in vmalloc.c.
vmalloc_to_page code.  This code has now changed, so I think get_ptep needs updating.
See get_old_pud in mremap.c.  It looks good to me.

Dave
Helge Deller May 17, 2022, 6:11 p.m. UTC | #13
On 5/17/22 16:26, John David Anglin wrote:
> On 2022-05-17 9:24 a.m., Helge Deller wrote:
>> On 5/17/22 15:19, Rolf Eike Beer wrote:
>>> Am Montag, 16. Mai 2022, 23:49:10 CEST schrieb Helge Deller:
>>>> On 5/16/22 23:28, Rolf Eike Beer wrote:
>>>>> Am Montag, 16. Mai 2022, 17:14:47 CEST schrieb John David Anglin:
>>>>>> Originally, I was convinced that we needed to use tmpalias flushes
>>>>>> everwhere, for both user and kernel flushes. However, when I modified
>>>>>> flush_kernel_dcache_page_addr, to use a tmpalias flush, my c8000
>>>>>> would crash quite early when booting.
>>>>>>
>>>>>> The PDC returns alias values of 0 for the icache and dcache. This
>>>>>> indicates that either the alias boundary is greater than 16MB or
>>>>>> equivalent aliasing doesn't work. I modified the tmpalias code to
>>>>>> make it easy to try alternate boundaries. I tried boundaries up to
>>>>>> 128MB but still kernel tmpalias flushes didn't work on c8000.
>>>>>>
>>>>>> This led me to conclude that tmpalias flushes don't work on PA8800
>>>>>> and PA8900 machines, and that we needed to flush directly using the
>>>>>> virtual address of user and kernel pages. This is likely the major
>>>>>> cause of instability on the c8000 and rp34xx machines.
>>>>>>
>>>>>> Flushing user pages requires doing a temporary context switch as we
>>>>>> have to flush pages that don't belong to the current context. Further,
>>>>>> we have to deal with pages that aren't present. If a page isn't
>>>>>> present, the flush instructions fault on every line.
>>>>>>
>>>>>> Other code has been rearranged and simplified based on testing. For
>>>>>> example, I introduced a flush_cache_dup_mm routine. flush_cache_mm
>>>>>> and flush_cache_dup_mm differ in that flush_cache_mm calls
>>>>>> purge_cache_pages and flush_cache_dup_mm calls flush_cache_pages.
>>>>>> In some implementations, pdc is more efficient than fdc. Based on
>>>>>> my testing, I don't believe there's any performance benefit on the
>>>>>> c8000.
>>>>>>
>>>>>> V2:
>>>>>> 1) Add flush_cache_page_check_pte routine.
>>>>>> 2) Use it in copy_to_user_page and copy_from_user_page.
>>>>>> 3) flush_anon_page moved to cache.c and updated.
>>>>>> 4) Changed alias boundary to 64 MB for 64-bit kernels. Revised comment
>>>>>>
>>>>>>     regarding alias boundary for PA8800/PA8900 processors.
>>>>>>
>>>>>> 5) Removed struct mm_struct * argument from flush_cache_pages.
>>>>>> 6) Fixed thinko in flush_cache_range. It increased the number of pages
>>>>>>
>>>>>>     flushed and slowed performance.
>>>>>>
>>>>>> 7) Removed sync changes from pacache.S.
>>>>>>
>>>>>> V3:
>>>>>> 1) copy_to_user_page and copy_from_user_page moved to cache.c to
>>>>>>
>>>>>>     improve inlining.
>>>>>>
>>>>>> 2) Replaced copy_user_page with copy_user_highpage.
>>>>>> 3) Fixed cache threshold calculation on 32-bit kernels.
>>>>>> 4) Don't warn on inequivalent private mappings in flush_dcache_page.
>>>>>> 5) Return early from mm_total_size if size exceeds
>>>>>>
>>>>>>     parisc_cache_flush_threshold.
>>>>>>
>>>>>> 6) Disable non-access TLB warning in handle_nadtlb_fault. Warning
>>>>>>
>>>>>>     happens occassionally handling flushes for COW faults.
>>>>>>
>>>>>> 7) Remove flush_cache_dup_mm.
>>>>>> 8) Flush entire cache in flush_cache_mm and flush_cache_range on
>>>>>>
>>>>>>     processors with aliasing caches. Only flush small cache ranges
>>>>>>     on machines with PA8800/PA8900 processors.
>>>>>>
>>>>>> 9) Tested on rp3440, c8000 and c3750.
>>>>> Given how long these changelogs are, and how fragile the whole caching is
>>>>> I
>>>>> think it is a good idea to split this patch into smaller ones, to improve
>>>>> readability and being able to bisect it.
>>>> FWIW, I've done some cleanups to this patch and committed it to my for-next
>>>> tree. In case it's split up, please use the revised version.
>>> Why did you modify get_ptep()? Until now it was just moved around in the file,
>>> and IMHO it becomes less readable because all these needless variables are
>>> batched up at the start of the function now. The only point I would see in
>>> moving them all to the front is if there would be no nesting anymore, and
>>> every condition was inverted:
>> Dave's original patch did not moved the variables to the beginning.
>> That change was me - just because checkpatch complained otherwise.
>>
>> I agree that it's less readable.
> The get_ptep code was originally based on the vmalloc_to_page code in vmalloc.c.
> vmalloc_to_page code.  This code has now changed, so I think get_ptep needs updating.
> See get_old_pud in mremap.c.  It looks good to me.

So, what's the consense of this discussion now?

I can easily split out the pr_warn("WARNING").
Moving the get_ptep() back to the original place seems ok, and I'll keep
the strange indenting which checkpatch want.
Is that ok?

Helge
Rolf Eike Beer May 17, 2022, 6:28 p.m. UTC | #14
Am Dienstag, 17. Mai 2022, 20:11:38 CEST schrieb Helge Deller:

> I can easily split out the pr_warn("WARNING").

Would make sense IMHO.

> Moving the get_ptep() back to the original place seems ok, and I'll keep
> the strange indenting which checkpatch want.

If its back at the original place then there is no need to change as 
checkpatch will not complain on unmodified lines. If it needs to be moved and 
changed then I would say do it in it's own patch as well.

Eike
John David Anglin May 17, 2022, 6:44 p.m. UTC | #15
On 2022-05-17 2:28 p.m., Rolf Eike Beer wrote:
>> Moving the get_ptep() back to the original place seems ok, and I'll keep
>> the strange indenting which checkpatch want.
> If its back at the original place then there is no need to change as
> checkpatch will not complain on unmodified lines. If it needs to be moved and
> changed then I would say do it in it's own patch as well.
I moved it forward because there are new references.  Otherwise, a static declaration
is needed.

I wanted the "copy" routines together.

Dave
John David Anglin May 17, 2022, 6:51 p.m. UTC | #16
On 2022-05-17 2:11 p.m., Helge Deller wrote:
> On 5/17/22 16:26, John David Anglin wrote:
>> On 2022-05-17 9:24 a.m., Helge Deller wrote:
>>> On 5/17/22 15:19, Rolf Eike Beer wrote:
>>>> Am Montag, 16. Mai 2022, 23:49:10 CEST schrieb Helge Deller:
>>>>> On 5/16/22 23:28, Rolf Eike Beer wrote:
>>>>>> Am Montag, 16. Mai 2022, 17:14:47 CEST schrieb John David Anglin:
>>>>>>> Originally, I was convinced that we needed to use tmpalias flushes
>>>>>>> everwhere, for both user and kernel flushes. However, when I modified
>>>>>>> flush_kernel_dcache_page_addr, to use a tmpalias flush, my c8000
>>>>>>> would crash quite early when booting.
>>>>>>>
>>>>>>> The PDC returns alias values of 0 for the icache and dcache. This
>>>>>>> indicates that either the alias boundary is greater than 16MB or
>>>>>>> equivalent aliasing doesn't work. I modified the tmpalias code to
>>>>>>> make it easy to try alternate boundaries. I tried boundaries up to
>>>>>>> 128MB but still kernel tmpalias flushes didn't work on c8000.
>>>>>>>
>>>>>>> This led me to conclude that tmpalias flushes don't work on PA8800
>>>>>>> and PA8900 machines, and that we needed to flush directly using the
>>>>>>> virtual address of user and kernel pages. This is likely the major
>>>>>>> cause of instability on the c8000 and rp34xx machines.
>>>>>>>
>>>>>>> Flushing user pages requires doing a temporary context switch as we
>>>>>>> have to flush pages that don't belong to the current context. Further,
>>>>>>> we have to deal with pages that aren't present. If a page isn't
>>>>>>> present, the flush instructions fault on every line.
>>>>>>>
>>>>>>> Other code has been rearranged and simplified based on testing. For
>>>>>>> example, I introduced a flush_cache_dup_mm routine. flush_cache_mm
>>>>>>> and flush_cache_dup_mm differ in that flush_cache_mm calls
>>>>>>> purge_cache_pages and flush_cache_dup_mm calls flush_cache_pages.
>>>>>>> In some implementations, pdc is more efficient than fdc. Based on
>>>>>>> my testing, I don't believe there's any performance benefit on the
>>>>>>> c8000.
>>>>>>>
>>>>>>> V2:
>>>>>>> 1) Add flush_cache_page_check_pte routine.
>>>>>>> 2) Use it in copy_to_user_page and copy_from_user_page.
>>>>>>> 3) flush_anon_page moved to cache.c and updated.
>>>>>>> 4) Changed alias boundary to 64 MB for 64-bit kernels. Revised comment
>>>>>>>
>>>>>>>      regarding alias boundary for PA8800/PA8900 processors.
>>>>>>>
>>>>>>> 5) Removed struct mm_struct * argument from flush_cache_pages.
>>>>>>> 6) Fixed thinko in flush_cache_range. It increased the number of pages
>>>>>>>
>>>>>>>      flushed and slowed performance.
>>>>>>>
>>>>>>> 7) Removed sync changes from pacache.S.
>>>>>>>
>>>>>>> V3:
>>>>>>> 1) copy_to_user_page and copy_from_user_page moved to cache.c to
>>>>>>>
>>>>>>>      improve inlining.
>>>>>>>
>>>>>>> 2) Replaced copy_user_page with copy_user_highpage.
>>>>>>> 3) Fixed cache threshold calculation on 32-bit kernels.
>>>>>>> 4) Don't warn on inequivalent private mappings in flush_dcache_page.
>>>>>>> 5) Return early from mm_total_size if size exceeds
>>>>>>>
>>>>>>>      parisc_cache_flush_threshold.
>>>>>>>
>>>>>>> 6) Disable non-access TLB warning in handle_nadtlb_fault. Warning
>>>>>>>
>>>>>>>      happens occassionally handling flushes for COW faults.
>>>>>>>
>>>>>>> 7) Remove flush_cache_dup_mm.
>>>>>>> 8) Flush entire cache in flush_cache_mm and flush_cache_range on
>>>>>>>
>>>>>>>      processors with aliasing caches. Only flush small cache ranges
>>>>>>>      on machines with PA8800/PA8900 processors.
>>>>>>>
>>>>>>> 9) Tested on rp3440, c8000 and c3750.
>>>>>> Given how long these changelogs are, and how fragile the whole caching is
>>>>>> I
>>>>>> think it is a good idea to split this patch into smaller ones, to improve
>>>>>> readability and being able to bisect it.
>>>>> FWIW, I've done some cleanups to this patch and committed it to my for-next
>>>>> tree. In case it's split up, please use the revised version.
>>>> Why did you modify get_ptep()? Until now it was just moved around in the file,
>>>> and IMHO it becomes less readable because all these needless variables are
>>>> batched up at the start of the function now. The only point I would see in
>>>> moving them all to the front is if there would be no nesting anymore, and
>>>> every condition was inverted:
>>> Dave's original patch did not moved the variables to the beginning.
>>> That change was me - just because checkpatch complained otherwise.
>>>
>>> I agree that it's less readable.
>> The get_ptep code was originally based on the vmalloc_to_page code in vmalloc.c.
>> vmalloc_to_page code.  This code has now changed, so I think get_ptep needs updating.
>> See get_old_pud in mremap.c.  It looks good to me.
> So, what's the consense of this discussion now?
>
> I can easily split out the pr_warn("WARNING").
> Moving the get_ptep() back to the original place seems ok, and I'll keep
> the strange indenting which checkpatch want.
> Is that ok?
It's okay with me.  But maybe we need to add pgd_leaf and pgd_bad checks (etc).  That's
what is concerning me.

Dave
Helge Deller May 17, 2022, 8 p.m. UTC | #17
On 5/17/22 20:28, Rolf Eike Beer wrote:
> Am Dienstag, 17. Mai 2022, 20:11:38 CEST schrieb Helge Deller:
>
>> I can easily split out the pr_warn("WARNING").
>
> Would make sense IMHO.

I split that patch out now.
Dave, can you please check if you are ok with it?
https://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git/commit/?h=for-next&id=67c35a3b646cc68598ff0bb28de5f8bd7b2e81b3
I used the wording from your other mail.

>> Moving the get_ptep() back to the original place seems ok, and I'll keep
>> the strange indenting which checkpatch want.
>
> If its back at the original place then there is no need to change as
> checkpatch will not complain on unmodified lines.

I meant "back to the place where it was versions before".
So, it has to move anyway now.

> If it needs to be moved and
> changed then I would say do it in it's own patch as well.

I kept it in the way Dave sent it (with the checkpatch fixups I added).

Just pushed a new "for-next" tree at the usual place:
https://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git/log/?h=for-next

Helge
John David Anglin May 17, 2022, 8:19 p.m. UTC | #18
On 2022-05-17 4:00 p.m., Helge Deller wrote:
> On 5/17/22 20:28, Rolf Eike Beer wrote:
>> Am Dienstag, 17. Mai 2022, 20:11:38 CEST schrieb Helge Deller:
>>
>>> I can easily split out the pr_warn("WARNING").
>> Would make sense IMHO.
> I split that patch out now.
> Dave, can you please check if you are ok with it?
> https://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git/commit/?h=for-next&id=67c35a3b646cc68598ff0bb28de5f8bd7b2e81b3
> I used the wording from your other mail.
Looks better than what I sent!
>
>>> Moving the get_ptep() back to the original place seems ok, and I'll keep
>>> the strange indenting which checkpatch want.
>> If its back at the original place then there is no need to change as
>> checkpatch will not complain on unmodified lines.
> I meant "back to the place where it was versions before".
> So, it has to move anyway now.
>
>> If it needs to be moved and
>> changed then I would say do it in it's own patch as well.
> I kept it in the way Dave sent it (with the checkpatch fixups I added).
>
> Just pushed a new "for-next" tree at the usual place:
> https://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git/log/?h=for-next
Will test.

Dave
diff mbox series

Patch

diff --git a/arch/parisc/include/asm/cacheflush.h b/arch/parisc/include/asm/cacheflush.h
index e8b4a03343d3..8331930ac695 100644
--- a/arch/parisc/include/asm/cacheflush.h
+++ b/arch/parisc/include/asm/cacheflush.h
@@ -59,19 +59,8 @@  void flush_dcache_page(struct page *page);
 	flush_kernel_icache_range_asm(s,e); 		\
 } while (0)
 
-#define copy_to_user_page(vma, page, vaddr, dst, src, len) \
-do { \
-	flush_cache_page(vma, vaddr, page_to_pfn(page)); \
-	memcpy(dst, src, len); \
-	flush_kernel_dcache_range_asm((unsigned long)dst, (unsigned long)dst + len); \
-} while (0)
-
-#define copy_from_user_page(vma, page, vaddr, dst, src, len) \
-do { \
-	flush_cache_page(vma, vaddr, page_to_pfn(page)); \
-	memcpy(dst, src, len); \
-} while (0)
-
+void copy_to_user_page(struct vm_area_struct *vma, struct page *page, unsigned long user_vaddr, void *dst, void *src, int len);
+void copy_from_user_page(struct vm_area_struct *vma, struct page *page, unsigned long user_vaddr, void *dst, void *src, int len);
 void flush_cache_page(struct vm_area_struct *vma, unsigned long vmaddr, unsigned long pfn);
 void flush_cache_range(struct vm_area_struct *vma,
 		unsigned long start, unsigned long end);
@@ -80,16 +69,7 @@  void flush_cache_range(struct vm_area_struct *vma,
 void flush_dcache_page_asm(unsigned long phys_addr, unsigned long vaddr);
 
 #define ARCH_HAS_FLUSH_ANON_PAGE
-static inline void
-flush_anon_page(struct vm_area_struct *vma, struct page *page, unsigned long vmaddr)
-{
-	if (PageAnon(page)) {
-		flush_tlb_page(vma, vmaddr);
-		preempt_disable();
-		flush_dcache_page_asm(page_to_phys(page), vmaddr);
-		preempt_enable();
-	}
-}
+void flush_anon_page(struct vm_area_struct *vma, struct page *page, unsigned long vmaddr);
 
 #define ARCH_HAS_FLUSH_ON_KUNMAP
 static inline void kunmap_flush_on_unmap(void *addr)
diff --git a/arch/parisc/include/asm/page.h b/arch/parisc/include/asm/page.h
index 0561568f7b48..20c89ed9d7da 100644
--- a/arch/parisc/include/asm/page.h
+++ b/arch/parisc/include/asm/page.h
@@ -26,12 +26,13 @@ 
 #define copy_page(to, from)	copy_page_asm((void *)(to), (void *)(from))
 
 struct page;
+struct vm_area_struct;
 
 void clear_page_asm(void *page);
 void copy_page_asm(void *to, void *from);
 #define clear_user_page(vto, vaddr, page) clear_page_asm(vto)
-void copy_user_page(void *vto, void *vfrom, unsigned long vaddr,
-			struct page *pg);
+void copy_user_highpage(struct page *to, struct page *from, unsigned long vaddr, struct vm_area_struct *vma);
+#define __HAVE_ARCH_COPY_USER_HIGHPAGE
 
 /*
  * These are used to make use of C type-checking..
diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
index ee637c55f2b1..618dfda3549e 100644
--- a/arch/parisc/kernel/cache.c
+++ b/arch/parisc/kernel/cache.c
@@ -27,6 +27,7 @@ 
 #include <asm/processor.h>
 #include <asm/sections.h>
 #include <asm/shmparam.h>
+#include <asm/mmu_context.h>
 
 int split_tlb __ro_after_init;
 int dcache_stride __ro_after_init;
@@ -91,7 +92,7 @@  static inline void flush_data_cache(void)
 }
 
 
-/* Virtual address of pfn.  */
+/* Kernel virtual address of pfn.  */
 #define pfn_va(pfn)	__va(PFN_PHYS(pfn))
 
 void
@@ -124,11 +125,13 @@  show_cache_info(struct seq_file *m)
 		cache_info.ic_size/1024 );
 	if (cache_info.dc_loop != 1)
 		snprintf(buf, 32, "%lu-way associative", cache_info.dc_loop);
-	seq_printf(m, "D-cache\t\t: %ld KB (%s%s, %s)\n",
+	seq_printf(m, "D-cache\t\t: %ld KB (%s%s, %s, alias=%d)\n",
 		cache_info.dc_size/1024,
 		(cache_info.dc_conf.cc_wt ? "WT":"WB"),
 		(cache_info.dc_conf.cc_sh ? ", shared I/D":""),
-		((cache_info.dc_loop == 1) ? "direct mapped" : buf));
+		((cache_info.dc_loop == 1) ? "direct mapped" : buf),
+		cache_info.dc_conf.cc_alias
+	);
 	seq_printf(m, "ITLB entries\t: %ld\n" "DTLB entries\t: %ld%s\n",
 		cache_info.it_size,
 		cache_info.dt_size,
@@ -324,25 +327,77 @@  __flush_cache_page(struct vm_area_struct *vma, unsigned long vmaddr,
 	preempt_enable();
 }
 
-static inline void
-__purge_cache_page(struct vm_area_struct *vma, unsigned long vmaddr,
-		   unsigned long physaddr)
+static void flush_user_cache_page(struct vm_area_struct *vma, unsigned long vmaddr)
 {
-	if (!static_branch_likely(&parisc_has_cache))
-		return;
+	unsigned long flags, space, pgd, prot;
+#ifdef CONFIG_TLB_PTLOCK
+	unsigned long pgd_lock;
+#endif
+
+	vmaddr &= PAGE_MASK;
+
 	preempt_disable();
-	purge_dcache_page_asm(physaddr, vmaddr);
+
+	/* Set context for flush */
+	local_irq_save(flags);
+	prot = mfctl(8);
+	space = mfsp(SR_USER);
+	pgd = mfctl(25);
+#ifdef CONFIG_TLB_PTLOCK
+	pgd_lock = mfctl(28);
+#endif
+	switch_mm_irqs_off(NULL, vma->vm_mm, NULL);
+	local_irq_restore(flags);
+
+	flush_user_dcache_range_asm(vmaddr, vmaddr + PAGE_SIZE);
 	if (vma->vm_flags & VM_EXEC)
-		flush_icache_page_asm(physaddr, vmaddr);
+		flush_user_icache_range_asm(vmaddr, vmaddr + PAGE_SIZE);
+	flush_tlb_page(vma, vmaddr);
+
+	/* Restore previous context */
+	local_irq_save(flags);
+#ifdef CONFIG_TLB_PTLOCK
+	mtctl(pgd_lock, 28);
+#endif
+	mtctl(pgd, 25);
+	mtsp(space, SR_USER);
+	mtctl(prot, 8);
+	local_irq_restore(flags);
+
 	preempt_enable();
 }
 
+static inline pte_t *get_ptep(struct mm_struct *mm, unsigned long addr)
+{
+	pte_t *ptep = NULL;
+	pgd_t *pgd = mm->pgd;
+
+	if (!pgd_none(*pgd)) {
+		p4d_t *p4d = p4d_offset(pgd, addr);
+		if (!p4d_none(*p4d)) {
+			pud_t *pud = pud_offset(p4d, addr);
+			if (!pud_none(*pud)) {
+				pmd_t *pmd = pmd_offset(pud, addr);
+				if (!pmd_none(*pmd))
+					ptep = pte_offset_map(pmd, addr);
+			}
+		}
+	}
+	return ptep;
+}
+
+static inline bool pte_flush(pte_t pte)
+{
+	return (pte_val(pte) & (_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_NO_CACHE)) == (_PAGE_PRESENT | _PAGE_ACCESSED);
+}
+
 void flush_dcache_page(struct page *page)
 {
 	struct address_space *mapping = page_mapping_file(page);
 	struct vm_area_struct *mpnt;
 	unsigned long offset;
 	unsigned long addr, old_addr = 0;
+	unsigned long count = 0;
 	pgoff_t pgoff;
 
 	if (mapping && !mapping_mapped(mapping)) {
@@ -357,33 +412,48 @@  void flush_dcache_page(struct page *page)
 
 	pgoff = page->index;
 
-	/* We have carefully arranged in arch_get_unmapped_area() that
+	/*
+	 * We have carefully arranged in arch_get_unmapped_area() that
 	 * *any* mappings of a file are always congruently mapped (whether
 	 * declared as MAP_PRIVATE or MAP_SHARED), so we only need
-	 * to flush one address here for them all to become coherent */
-
+	 * to flush one address here for them all to become coherent
+	 * on machines that support equivalent aliasing
+	 */
 	flush_dcache_mmap_lock(mapping);
 	vma_interval_tree_foreach(mpnt, &mapping->i_mmap, pgoff, pgoff) {
 		offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT;
 		addr = mpnt->vm_start + offset;
-
-		/* The TLB is the engine of coherence on parisc: The
-		 * CPU is entitled to speculate any page with a TLB
-		 * mapping, so here we kill the mapping then flush the
-		 * page along a special flush only alias mapping.
-		 * This guarantees that the page is no-longer in the
-		 * cache for any process and nor may it be
-		 * speculatively read in (until the user or kernel
-		 * specifically accesses it, of course) */
-
-		flush_tlb_page(mpnt, addr);
-		if (old_addr == 0 || (old_addr & (SHM_COLOUR - 1))
-				      != (addr & (SHM_COLOUR - 1))) {
-			__flush_cache_page(mpnt, addr, page_to_phys(page));
-			if (parisc_requires_coherency() && old_addr)
-				printk(KERN_ERR "INEQUIVALENT ALIASES 0x%lx and 0x%lx in file %pD\n", old_addr, addr, mpnt->vm_file);
-			old_addr = addr;
+		if (parisc_requires_coherency()) {
+			pte_t *ptep = get_ptep(mpnt->vm_mm, addr);
+			if (ptep && pte_flush(*ptep))
+				flush_user_cache_page(mpnt, addr);
+		} else {
+			/*
+			 * The TLB is the engine of coherence on parisc:
+			 * The CPU is entitled to speculate any page
+			 * with a TLB mapping, so here we kill the
+			 * mapping then flush the page along a special
+			 * flush only alias mapping. This guarantees that
+			 * the page is no-longer in the cache for any
+			 * process and nor may it be speculatively read
+			 * in (until the user or kernel specifically
+			 * accesses it, of course)
+			 */
+			flush_tlb_page(mpnt, addr);
+			if (old_addr == 0 || (old_addr & (SHM_COLOUR - 1)) != (addr & (SHM_COLOUR - 1))) {
+				__flush_cache_page(mpnt, addr, page_to_phys(page));
+				/*
+				 * Software is allowed to have any number
+				 * of private mappings to a page.
+				 */
+				if (!(mpnt->vm_flags & VM_SHARED))
+					continue;
+				if (old_addr)
+					printk(KERN_ERR "INEQUIVALENT ALIASES 0x%lx and 0x%lx in file %pD\n", old_addr, addr, mpnt->vm_file);
+				old_addr = addr;
+			}
 		}
+		BUG_ON(++count > 4096);
 	}
 	flush_dcache_mmap_unlock(mapping);
 }
@@ -403,7 +473,7 @@  void __init parisc_setup_cache_timing(void)
 {
 	unsigned long rangetime, alltime;
 	unsigned long size;
-	unsigned long threshold;
+	unsigned long threshold, threshold2;
 
 	alltime = mfctl(16);
 	flush_data_cache();
@@ -417,11 +487,16 @@  void __init parisc_setup_cache_timing(void)
 	printk(KERN_DEBUG "Whole cache flush %lu cycles, flushing %lu bytes %lu cycles\n",
 		alltime, size, rangetime);
 
-	threshold = L1_CACHE_ALIGN(size * alltime / rangetime);
-	if (threshold > cache_info.dc_size)
-		threshold = cache_info.dc_size;
-	if (threshold)
-		parisc_cache_flush_threshold = threshold;
+	threshold = L1_CACHE_ALIGN((unsigned long)((uint64_t)size * alltime / rangetime));
+	printk(KERN_INFO "Calculated flush threshold is %lu KiB\n",
+		threshold/1024);
+
+	/*
+	 * The threshold computed above isn't very reliable. The following
+	 * heuristic works reasonably well on c8000/rp3440.
+	 */
+	threshold2 = cache_info.dc_size * num_online_cpus();
+	parisc_cache_flush_threshold = threshold2;
 	printk(KERN_INFO "Cache flush threshold set to %lu KiB\n",
 		parisc_cache_flush_threshold/1024);
 
@@ -477,19 +552,44 @@  void flush_kernel_dcache_page_addr(void *addr)
 }
 EXPORT_SYMBOL(flush_kernel_dcache_page_addr);
 
-void copy_user_page(void *vto, void *vfrom, unsigned long vaddr,
-	struct page *pg)
+static void flush_cache_page_if_present(struct vm_area_struct *vma, unsigned long vmaddr, unsigned long pfn)
 {
-       /* Copy using kernel mapping.  No coherency is needed (all in
-	  kunmap) for the `to' page.  However, the `from' page needs to
-	  be flushed through a mapping equivalent to the user mapping
-	  before it can be accessed through the kernel mapping. */
-	preempt_disable();
-	flush_dcache_page_asm(__pa(vfrom), vaddr);
-	copy_page_asm(vto, vfrom);
-	preempt_enable();
+	pte_t *ptep = get_ptep(vma->vm_mm, vmaddr);
+
+	/*
+	 * The pte check is racy and sometimes the flush will trigger
+	 * a non-access TLB miss. Hopefully, the page has already been
+	 * flushed.
+	 */
+	if (ptep && pte_flush(*ptep))
+		flush_cache_page(vma, vmaddr, pfn);
+}
+
+void copy_user_highpage(struct page *to, struct page *from,
+	unsigned long vaddr, struct vm_area_struct *vma)
+{
+	void *kto, *kfrom;
+
+	kfrom = kmap_local_page(from);
+	kto = kmap_local_page(to);
+	flush_cache_page_if_present(vma, vaddr, page_to_pfn(from));
+	copy_page_asm(kto, kfrom);
+	kunmap_local(kto);
+	kunmap_local(kfrom);
+}
+
+void copy_to_user_page(struct vm_area_struct *vma, struct page *page, unsigned long user_vaddr, void *dst, void *src, int len)
+{
+	flush_cache_page_if_present(vma, user_vaddr, page_to_pfn(page));
+	memcpy(dst, src, len);
+	flush_kernel_dcache_range_asm((unsigned long)dst, (unsigned long)dst + len);
+}
+
+void copy_from_user_page(struct vm_area_struct *vma, struct page *page, unsigned long user_vaddr, void *dst, void *src, int len)
+{
+	flush_cache_page_if_present(vma, user_vaddr, page_to_pfn(page));
+	memcpy(dst, src, len);
 }
-EXPORT_SYMBOL(copy_user_page);
 
 /* __flush_tlb_range()
  *
@@ -520,91 +620,99 @@  int __flush_tlb_range(unsigned long sid, unsigned long start,
 	return 0;
 }
 
-static inline unsigned long mm_total_size(struct mm_struct *mm)
+static void flush_cache_pages(struct vm_area_struct *vma, unsigned long start, unsigned long end)
 {
-	struct vm_area_struct *vma;
-	unsigned long usize = 0;
-
-	for (vma = mm->mmap; vma; vma = vma->vm_next)
-		usize += vma->vm_end - vma->vm_start;
-	return usize;
-}
-
-static inline pte_t *get_ptep(pgd_t *pgd, unsigned long addr)
-{
-	pte_t *ptep = NULL;
+	unsigned long addr, pfn;
+	pte_t *ptep;
 
-	if (!pgd_none(*pgd)) {
-		p4d_t *p4d = p4d_offset(pgd, addr);
-		if (!p4d_none(*p4d)) {
-			pud_t *pud = pud_offset(p4d, addr);
-			if (!pud_none(*pud)) {
-				pmd_t *pmd = pmd_offset(pud, addr);
-				if (!pmd_none(*pmd))
-					ptep = pte_offset_map(pmd, addr);
+	for (addr = start; addr < end; addr += PAGE_SIZE) {
+		/*
+		 * The vma can contain pages that aren't present. Although
+		 * the pte search is expensive, we need the pte to find the
+		 * page pfn and to check whether the page should be flushed.
+		 */
+		ptep = get_ptep(vma->vm_mm, addr);
+		if (ptep && pte_flush(*ptep)) {
+			if (parisc_requires_coherency()) {
+				flush_user_cache_page(vma, addr);
+			} else {
+				pfn = pte_pfn(*ptep);
+				BUG_ON(!pfn_valid(pfn));
+				__flush_cache_page(vma, addr, PFN_PHYS(pfn));
 			}
 		}
 	}
-	return ptep;
 }
 
-static void flush_cache_pages(struct vm_area_struct *vma, struct mm_struct *mm,
-			      unsigned long start, unsigned long end)
+static inline unsigned long mm_total_size(struct mm_struct *mm)
 {
-	unsigned long addr, pfn;
-	pte_t *ptep;
+	struct vm_area_struct *vma;
+	unsigned long usize = 0;
 
-	for (addr = start; addr < end; addr += PAGE_SIZE) {
-		ptep = get_ptep(mm->pgd, addr);
-		if (ptep) {
-			pfn = pte_pfn(*ptep);
-			flush_cache_page(vma, addr, pfn);
-		}
-	}
+	for (vma = mm->mmap; vma && usize < parisc_cache_flush_threshold; vma = vma->vm_next)
+		usize += vma->vm_end - vma->vm_start;
+	return usize;
 }
 
 void flush_cache_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 
-	/* Flushing the whole cache on each cpu takes forever on
-	   rp3440, etc.  So, avoid it if the mm isn't too big.  */
-	if ((!IS_ENABLED(CONFIG_SMP) || !arch_irqs_disabled()) &&
-	    mm_total_size(mm) >= parisc_cache_flush_threshold) {
-		if (mm->context.space_id)
-			flush_tlb_all();
+	/*
+	 * Flushing the whole cache on each cpu takes forever on
+	 * rp3440, etc. So, avoid it if the mm isn't too big.
+	 *
+	 * Note that we must flush the entire cache on machines
+	 * with aliasing caches to prevent random segmentation
+	 * faults.
+	 */
+	if (!parisc_requires_coherency()
+	    ||  mm_total_size(mm) >= parisc_cache_flush_threshold) {
+		BUG_ON(IS_ENABLED(CONFIG_SMP) && arch_irqs_disabled());
+		flush_tlb_all();
 		flush_cache_all();
 		return;
 	}
 
+	/* Flush mm */
 	for (vma = mm->mmap; vma; vma = vma->vm_next)
-		flush_cache_pages(vma, mm, vma->vm_start, vma->vm_end);
+		flush_cache_pages(vma, vma->vm_start, vma->vm_end);
 }
 
-void flush_cache_range(struct vm_area_struct *vma,
-		unsigned long start, unsigned long end)
+void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
 {
-	if ((!IS_ENABLED(CONFIG_SMP) || !arch_irqs_disabled()) &&
-	    end - start >= parisc_cache_flush_threshold) {
-		if (vma->vm_mm->context.space_id)
-			flush_tlb_range(vma, start, end);
+	if (!parisc_requires_coherency()
+	    || end - start >= parisc_cache_flush_threshold) {
+		BUG_ON(IS_ENABLED(CONFIG_SMP) && arch_irqs_disabled());
+		flush_tlb_range(vma, start, end);
 		flush_cache_all();
 		return;
 	}
 
-	flush_cache_pages(vma, vma->vm_mm, start, end);
+	flush_cache_pages(vma, start, end);
 }
 
-void
-flush_cache_page(struct vm_area_struct *vma, unsigned long vmaddr, unsigned long pfn)
+void flush_cache_page(struct vm_area_struct *vma, unsigned long vmaddr, unsigned long pfn)
 {
-	if (pfn_valid(pfn)) {
-		if (likely(vma->vm_mm->context.space_id)) {
-			flush_tlb_page(vma, vmaddr);
-			__flush_cache_page(vma, vmaddr, PFN_PHYS(pfn));
-		} else {
-			__purge_cache_page(vma, vmaddr, PFN_PHYS(pfn));
+	BUG_ON(!pfn_valid(pfn));
+	if (parisc_requires_coherency())
+		flush_user_cache_page(vma, vmaddr);
+	else
+		__flush_cache_page(vma, vmaddr, PFN_PHYS(pfn));
+}
+
+void flush_anon_page(struct vm_area_struct *vma, struct page *page, unsigned long vmaddr)
+{
+	if (PageAnon(page)) {
+		if (parisc_requires_coherency()) {
+			flush_user_cache_page(vma, vmaddr);
+			return;
 		}
+
+		flush_tlb_page(vma, vmaddr);
+		preempt_disable();
+		flush_dcache_page_asm(page_to_phys(page), vmaddr);
+		preempt_enable();
 	}
 }
 
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index f114e102aaf2..ca49765784fc 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -22,6 +22,8 @@ 
 
 #include <asm/traps.h>
 
+/* #define DEBUG_NATLB 1 */
+
 /* Various important other fields */
 #define bit22set(x)		(x & 0x00000200)
 #define bits23_25set(x)		(x & 0x000001c0)
@@ -450,10 +452,12 @@  handle_nadtlb_fault(struct pt_regs *regs)
 		fallthrough;
 	case 0x380:
 		/* PDC and FIC instructions */
+#ifdef DEBUG_NATLB
 		if (printk_ratelimit()) {
-			pr_warn("BUG: nullifying cache flush/purge instruction\n");
+			pr_warn("WARNING: nullifying cache flush/purge instruction\n");
 			show_regs(regs);
 		}
+#endif
 		if (insn & 0x20) {
 			/* Base modification */
 			breg = (insn >> 21) & 0x1f;