diff mbox

[v1,2/4] ioremap: Invalidate TLB after huge mappings

Message ID 1521017305-28518-3-git-send-email-cpandya@codeaurora.org (mailing list archive)
State New, archived
Headers show

Commit Message

Chintan Pandya March 14, 2018, 8:48 a.m. UTC
If huge mappings are enabled, they can override
valid intermediate previous mappings. Some MMU
can speculatively pre-fetch these intermediate
entries even after unmap. That's because unmap
will clear only last level entries in page table
keeping intermediate (pud/pmd) entries still valid.

This can potentially lead to stale TLB entries
which needs invalidation after map.

Some more info: https://lkml.org/lkml/2017/12/23/3

There is one noted case for ARM64 where such stale
TLB entries causes 3rd level translation fault even
after correct (huge) mapping is available.

See the case below (reproduced locally with tests),

[17505.330123] Unable to handle kernel paging request at virtual address ffffff801ae00000
[17505.338100] pgd = ffffff800a761000
[17505.341566] [ffffff801ae00000] *pgd=000000017e1be003, *pud=000000017e1be003, *pmd=00e8000098000f05
[17505.350704] ------------[ cut here ]------------
[17505.355362] Kernel BUG at ffffff8008238c30 [verbose debug info unavailable]
[17505.362375] Internal error: Oops: 96000007 [#1] PREEMPT SMP
[17505.367996] Modules linked in:
[17505.371114] CPU: 6 PID: 488 Comm: chintan-ioremap Not tainted 4.9.81+ #160
[17505.378039] Hardware name: Qualcomm Technologies, Inc. SDM845 v1 MTP (DT)
[17505.384885] task: ffffffc0e3e61180 task.stack: ffffffc0e3e70000
[17505.390868] PC is at io_remap_test+0x2e0/0x444
[17505.395352] LR is at io_remap_test+0x2d0/0x444
[17505.399835] pc : [<ffffff8008238c30>] lr : [<ffffff8008238c20>] pstate: 60c00005
[17505.407282] sp : ffffffc0e3e73d70
[17505.410624] x29: ffffffc0e3e73d70 x28: ffffff801ae00008
[17505.416031] x27: ffffff801ae00010 x26: ffffff801ae00018
[17505.421436] x25: ffffff801ae00020 x24: ffffff801adfffe0
[17505.426840] x23: ffffff801adfffe8 x22: ffffff801adffff0
[17505.432244] x21: ffffff801adffff8 x20: ffffff801ae00000
[17505.437648] x19: 0000000000000005 x18: 0000000000000000
[17505.443052] x17: 00000000b3409452 x16: 00000000923da470
[17505.448456] x15: 0000000071c9763c x14: 00000000a15658fa
[17505.453860] x13: 000000005cae96bf x12: 00000000e6d5c44a
[17505.459264] x11: 0140000000000000 x10: ffffff80099a1000
[17505.464668] x9 : 0000000000000000 x8 : ffffffc0e3e73d68
[17505.470072] x7 : ffffff80099d3220 x6 : 0000000000000015
[17505.475476] x5 : 00000c00004ad32a x4 : 000000000000000a
[17505.480880] x3 : 000000000682aaab x2 : 0000001345c2ad2e
[17505.486284] x1 : 7d78d61de56639ba x0 : 0000000000000001

Hence, invalidate once we override pmd/pud with huge
mappings.

Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
---
 lib/ioremap.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Mark Rutland March 14, 2018, 10:48 a.m. UTC | #1
On Wed, Mar 14, 2018 at 02:18:23PM +0530, Chintan Pandya wrote:
> If huge mappings are enabled, they can override
> valid intermediate previous mappings. Some MMU
> can speculatively pre-fetch these intermediate
> entries even after unmap. That's because unmap
> will clear only last level entries in page table
> keeping intermediate (pud/pmd) entries still valid.
> 
> This can potentially lead to stale TLB entries
> which needs invalidation after map.
> 
> Some more info: https://lkml.org/lkml/2017/12/23/3
> 
> There is one noted case for ARM64 where such stale
> TLB entries causes 3rd level translation fault even
> after correct (huge) mapping is available.

> Hence, invalidate once we override pmd/pud with huge
> mappings.

>  static int __read_mostly ioremap_p4d_capable;
> @@ -92,8 +93,10 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
>  		if (ioremap_pmd_enabled() &&
>  		    ((next - addr) == PMD_SIZE) &&
>  		    IS_ALIGNED(phys_addr + addr, PMD_SIZE)) {
> -			if (pmd_set_huge(pmd, phys_addr + addr, prot))
> +			if (pmd_set_huge(pmd, phys_addr + addr, prot)) {
> +				flush_tlb_pgtable(&init_mm, addr);
>  				continue;
> +			}
>  		}
>  
>  		if (ioremap_pte_range(pmd, addr, next, phys_addr + addr, prot))
> @@ -118,8 +121,10 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
>  		if (ioremap_pud_enabled() &&
>  		    ((next - addr) == PUD_SIZE) &&
>  		    IS_ALIGNED(phys_addr + addr, PUD_SIZE)) {
> -			if (pud_set_huge(pud, phys_addr + addr, prot))
> +			if (pud_set_huge(pud, phys_addr + addr, prot)) {
> +				flush_tlb_pgtable(&init_mm, addr);
>  				continue;
> +			}
>  		}

As has been noted in previous threads, the ARM architecture requires a
Break-Before-Make sequence when changing an entry from a table to a
block, as is the case here.

The means the necessary sequence is:

	1. Make the entry invalid
	2. Invalidate relevant TLB entries
	3. Write the new entry

Whereas above, the sequence is

	1. Write the new entry
	2. invalidate relevant TLB entries

Which is insufficient, and will lead to a number of problems.

Therefore, NAK to this patch.

Please read up on the Break-Before-Make requirements in the ARM ARM.

Thanks,
Mark.
Chintan Pandya March 14, 2018, 11:20 a.m. UTC | #2
On 3/14/2018 4:18 PM, Mark Rutland wrote:
> On Wed, Mar 14, 2018 at 02:18:23PM +0530, Chintan Pandya wrote:
>> If huge mappings are enabled, they can override
>> valid intermediate previous mappings. Some MMU
>> can speculatively pre-fetch these intermediate
>> entries even after unmap. That's because unmap
>> will clear only last level entries in page table
>> keeping intermediate (pud/pmd) entries still valid.
>>
>> This can potentially lead to stale TLB entries
>> which needs invalidation after map.
>>
>> Some more info: https://lkml.org/lkml/2017/12/23/3
>>
>> There is one noted case for ARM64 where such stale
>> TLB entries causes 3rd level translation fault even
>> after correct (huge) mapping is available.
> 
>> Hence, invalidate once we override pmd/pud with huge
>> mappings.
> 
>>   static int __read_mostly ioremap_p4d_capable;
>> @@ -92,8 +93,10 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
>>   		if (ioremap_pmd_enabled() &&
>>   		    ((next - addr) == PMD_SIZE) &&
>>   		    IS_ALIGNED(phys_addr + addr, PMD_SIZE)) {
>> -			if (pmd_set_huge(pmd, phys_addr + addr, prot))
>> +			if (pmd_set_huge(pmd, phys_addr + addr, prot)) {
>> +				flush_tlb_pgtable(&init_mm, addr);
>>   				continue;
>> +			}
>>   		}
>>   
>>   		if (ioremap_pte_range(pmd, addr, next, phys_addr + addr, prot))
>> @@ -118,8 +121,10 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
>>   		if (ioremap_pud_enabled() &&
>>   		    ((next - addr) == PUD_SIZE) &&
>>   		    IS_ALIGNED(phys_addr + addr, PUD_SIZE)) {
>> -			if (pud_set_huge(pud, phys_addr + addr, prot))
>> +			if (pud_set_huge(pud, phys_addr + addr, prot)) {
>> +				flush_tlb_pgtable(&init_mm, addr);
>>   				continue;
>> +			}
>>   		}
> 
> As has been noted in previous threads, the ARM architecture requires a
> Break-Before-Make sequence when changing an entry from a table to a
> block, as is the case here.
> 
> The means the necessary sequence is:
> 
> 	1. Make the entry invalid
> 	2. Invalidate relevant TLB entries
> 	3. Write the new entry
> 
We do this for PTEs. I don't see this applicable to PMDs. Because,

1) To mark any PMD invalid, we need to be sure that next level page
    table (I mean all the 512 PTEs) should be zero. That requires us
    to scan entire last level page. A big perf hit !
2) We need to perform step 1 for every unmap as we never know which
    unmap will make last level page table empty.

Moreover, problem comes only when 4K mapping was followed by 2M
mapping. In all other cases, retaining valid PMD has obvious perf
gain. That's what walk-cache is supposed to be introduced for.

So, I think to touch only problematic case and fix it with TLB
invalidate.

> Whereas above, the sequence is
> 
> 	1. Write the new entry
> 	2. invalidate relevant TLB entries
> 
> Which is insufficient, and will lead to a number of problems.
I couldn't think of new problems with this approach. Could you share
any problematic scenarios ?

Also, my test-case runs fine with these patches for 10+ hours.

> 
> Therefore, NAK to this patch.
> 
> Please read up on the Break-Before-Make requirements in the ARM ARM.
Sure, will get more from here.

> 
> Thanks,
> Mark.
> 
Thanks for the review Mark.

Chintan
Mark Rutland March 14, 2018, 11:48 a.m. UTC | #3
On Wed, Mar 14, 2018 at 04:50:35PM +0530, Chintan Pandya wrote:
> On 3/14/2018 4:18 PM, Mark Rutland wrote:
> > On Wed, Mar 14, 2018 at 02:18:23PM +0530, Chintan Pandya wrote:
> > As has been noted in previous threads, the ARM architecture requires a
> > Break-Before-Make sequence when changing an entry from a table to a
> > block, as is the case here.
> > 
> > The means the necessary sequence is:
> > 
> > 	1. Make the entry invalid
> > 	2. Invalidate relevant TLB entries
> > 	3. Write the new entry
> > 
> We do this for PTEs. I don't see this applicable to PMDs.

The architecture requires this for *all* levels of page table, when
certain changes are made. Switching an entry from a table to block (or
vice versa) is one of those changes, and this definitely applies to
PMDs.

> Because,
> 
> 1) To mark any PMD invalid, we need to be sure that next level page
>    table (I mean all the 512 PTEs) should be zero. That requires us
>    to scan entire last level page. A big perf hit !

This is in ioremap code. Under what workload does this constitute a perf
hit?

Regardless, so long as we mark the pmd entry invalid before the TLB
invalidation, we don't need to touch the next level table at all. We
just require a sequence like:

	pmd_clear(*pmdp);
	flush_tlb_kernel_range(pmd_start_addr, pmd_end_addr);
	pmd_set_huge(*pmdp, phys, prot);


> 2) We need to perform step 1 for every unmap as we never know which
>    unmap will make last level page table empty.

Sorry, I don't follow. Could you elaborate on the problem?

> Moreover, problem comes only when 4K mapping was followed by 2M
> mapping. In all other cases, retaining valid PMD has obvious perf
> gain. That's what walk-cache is supposed to be introduced for.

Retaining a valid PMD in the TLB that *differs* from a valid PMD in the
page tables is a big problem.

The architecture requires BBM, as this permits CPUs to allocate PMDs
into TLBs at *any* time, even if there's already PMD in the TLB for a
given address.

Thus, CPUs can allocate *both* valid PMDs into the TLBs. When this
happens, a TLB lookup can:

1) return either of the PMDs.

2) raise a TLB conflict abort.

3) return an amalgamation of the two entries (e.g. provide an erroneous
   address).

Note that (3) is particularly scary:

* The CPU could raise an SError if the amalgamated entry is junk.

* If a memory access hits an amalgamated entry, it may use the wrong
  physical address, attributes, or permissions, resulting in a number of
  potential problems.

* If the amalgamated entry looks like a partial walk, the TLB might try
  to perform a walk starting at the physical address in the amalgamated
  entry. This would cause page table walks to access bogus addresses,
  allocating junk into TLBs, and may result in SErrors or other aborts.

> > Whereas above, the sequence is
> > 
> > 	1. Write the new entry
> > 	2. invalidate relevant TLB entries
> > 
> > Which is insufficient, and will lead to a number of problems.
> I couldn't think of new problems with this approach. Could you share
> any problematic scenarios ?

Please see above.

> Also, my test-case runs fine with these patches for 10+ hours.

While this may happen to work on particular platforms, it is not
guaranteed per the architecture, and will fail on others.

Thanks,
Mark.
diff mbox

Patch

diff --git a/lib/ioremap.c b/lib/ioremap.c
index b808a39..c1e1341 100644
--- a/lib/ioremap.c
+++ b/lib/ioremap.c
@@ -13,6 +13,7 @@ 
 #include <linux/export.h>
 #include <asm/cacheflush.h>
 #include <asm/pgtable.h>
+#include <asm-generic/tlb.h>
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
 static int __read_mostly ioremap_p4d_capable;
@@ -92,8 +93,10 @@  static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
 		if (ioremap_pmd_enabled() &&
 		    ((next - addr) == PMD_SIZE) &&
 		    IS_ALIGNED(phys_addr + addr, PMD_SIZE)) {
-			if (pmd_set_huge(pmd, phys_addr + addr, prot))
+			if (pmd_set_huge(pmd, phys_addr + addr, prot)) {
+				flush_tlb_pgtable(&init_mm, addr);
 				continue;
+			}
 		}
 
 		if (ioremap_pte_range(pmd, addr, next, phys_addr + addr, prot))
@@ -118,8 +121,10 @@  static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
 		if (ioremap_pud_enabled() &&
 		    ((next - addr) == PUD_SIZE) &&
 		    IS_ALIGNED(phys_addr + addr, PUD_SIZE)) {
-			if (pud_set_huge(pud, phys_addr + addr, prot))
+			if (pud_set_huge(pud, phys_addr + addr, prot)) {
+				flush_tlb_pgtable(&init_mm, addr);
 				continue;
+			}
 		}
 
 		if (ioremap_pmd_range(pud, addr, next, phys_addr + addr, prot))