diff mbox

[5/6] ARM: re-implement physical address space switching

Message ID E1YfmY6-0006X3-1N@rmk-PC.arm.linux.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Russell King April 8, 2015, 9:45 a.m. UTC
Re-implement the physical address space switching to be architecturally
complaint.  This involves flushing the caches, disabling the MMU, and
only then updating the page tables.  Once that is complete, the system
can be brought back up again.

Since we disable the MMU, we need to do the update in assembly code.
Luckily, the entries which need updating are fairly trivial, and are
all setup by the early assembly code.  We can merely adjust each entry
by the delta required.

Not only does htis fix the code to be architecturally compliant, but it
fixes a couple of bugs too:

1. The original code would only ever update the first L2 entry covering
   a fraction of the kernel; the remainder were left untouched.
2. The L2 entries covering the DTB blob were likewise untouched.

This solution fixes up all entries.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
---
 arch/arm/mm/Kconfig        |   4 ++
 arch/arm/mm/Makefile       |   1 +
 arch/arm/mm/mmu.c          | 119 +++++++++++++--------------------------------
 arch/arm/mm/pv-fixup-asm.S |  88 +++++++++++++++++++++++++++++++++
 4 files changed, 128 insertions(+), 84 deletions(-)
 create mode 100644 arch/arm/mm/pv-fixup-asm.S

Comments

Thomas Petazzoni April 8, 2015, 2:34 p.m. UTC | #1
Dear Russell King,

On Wed, 08 Apr 2015 10:45:30 +0100, Russell King wrote:
> Re-implement the physical address space switching to be architecturally
> complaint.  This involves flushing the caches, disabling the MMU, and

Nit: s/complaint/compliant/.

But since I'm not going to write an e-mail just for such a small nit,
here is an hopefully more interesting question: do you think it would
be possible to use this logic for my Armada 370/XP/375/38x case, where
I need some specific page table configuration to be able to use HW I/O
coherency in non-SMP configurations?

If you remember the discussion we had, the primary issue is that the
page tables are built in arch/arm/kernel/head.S, before we even know
which SoC we are running on (and the CPU ID available from CP15 is not
sufficient to determine whether the SoC supports HW I/O coherency). If
we can get into the SoC-specific code, and at this point decide whether
and how the page tables should be rebuilt (which is apparently what
happens in Keystone), then it would be also useful for our case I
believe.

Thanks,

Thomas
Santosh Shilimkar April 8, 2015, 5:27 p.m. UTC | #2
On 4/8/2015 7:34 AM, Thomas Petazzoni wrote:
> Dear Russell King,
>
> On Wed, 08 Apr 2015 10:45:30 +0100, Russell King wrote:
>> Re-implement the physical address space switching to be architecturally
>> complaint.  This involves flushing the caches, disabling the MMU, and
>
> Nit: s/complaint/compliant/.
>
> But since I'm not going to write an e-mail just for such a small nit,
> here is an hopefully more interesting question: do you think it would
> be possible to use this logic for my Armada 370/XP/375/38x case, where
> I need some specific page table configuration to be able to use HW I/O
> coherency in non-SMP configurations?
>
> If you remember the discussion we had, the primary issue is that the
> page tables are built in arch/arm/kernel/head.S, before we even know
> which SoC we are running on (and the CPU ID available from CP15 is not
> sufficient to determine whether the SoC supports HW I/O coherency). If
> we can get into the SoC-specific code, and at this point decide whether
> and how the page tables should be rebuilt (which is apparently what
> happens in Keystone), then it would be also useful for our case I
> believe.
>
Well Keystone does switch the page tables to use entirely different
memory but what you are talking is, you want to control the page
table attributes for certain pages(memory area). Head.s page tables
are just for boot up. The actual page tables creation happens in
paging_init().

Not sure if you just need a special MT_* area which could take
care of your case but am missing details so no sure.

Regards,
Santosh
Mark Rutland April 8, 2015, 5:36 p.m. UTC | #3
Hi Russell,

On Wed, Apr 08, 2015 at 10:45:30AM +0100, Russell King wrote:
> Re-implement the physical address space switching to be architecturally
> complaint.  This involves flushing the caches, disabling the MMU, and
> only then updating the page tables.  Once that is complete, the system
> can be brought back up again.
> 
> Since we disable the MMU, we need to do the update in assembly code.
> Luckily, the entries which need updating are fairly trivial, and are
> all setup by the early assembly code.  We can merely adjust each entry
> by the delta required.
> 
> Not only does htis fix the code to be architecturally compliant, but it
> fixes a couple of bugs too:

Nit: s/htis/this/

[...]

> -       /*
> -        * Ensure that the above updates are flushed out of the cache.
> -        * This is not strictly correct; on a system where the caches
> -        * are coherent with each other, but the MMU page table walks
> -        * may not be coherent, flush_cache_all() may be a no-op, and
> -        * this will fail.
> +        * We changing not only the virtual to physical mapping, but
> +        * also the physical addresses used to access memory.  We need
> +        * to flush all levels of cache in the system with caching
> +        * disabled to ensure that all data is written back.  We do this
> +        * with caching and write buffering disabled to ensure that
> +        * nothing is speculatively prefetched.
>          */
> +       cr = get_cr();
> +       set_cr(cr & ~(CR_I | CR_C | CR_W));

SCTLR[3] (CR_W) is RAO/SBOP in VMSAv7. I don't think we should clear it
here for ARMv7.

To the best of my knowledge, the page table walkers aren't affected by
SCTLR.C, and use the attributes in TTBR{0,1} or TTBCR when translation
is active (i.e. when SCTLR.M is set). So at this point they can make
cacheable accesses to the page tables (and allocate into the caches) in
the background...

>         flush_cache_all();

...meaning this flush may not leave the caches empty, at least for
memory regions containing page tables. Stale lines could mask updates
made in lpae_pgtables_remap_asm.

I think that the cache flush needs to be performed after both
SCTLR.{C,M} are cleared in lpae_pgtables_remap_asm, just before the page
table updates. So long as the relevant pieces of kernel text are
initially clean to the PoC, we shouldn't need to flush anything
beforehand.


> 
>         /*
> -        * Re-write the TTBR values to point them at the high physical
> -        * alias of the page tables.  We expect __va() will work on
> -        * cpu_get_pgd(), which returns the value of TTBR0.
> +        * Fixup the page tables - this must be in the idmap region as
> +        * we need to disable the MMU to do this safely, and hence it
> +        * needs to be assembly.  It's fairly simple, as we're using the
> +        * temporary tables setup by the initial assembly code.
>          */
> -       cpu_switch_mm(pgd0, &init_mm);
> -       cpu_set_ttbr(1, __pa(pgd0) + TTBR1_OFFSET);
> +       lpae_pgtables_remap(offset, pa_pgd, boot_data);
> 
> -       /* Finally flush any stale TLB values. */
> -       local_flush_bp_all();
> -       local_flush_tlb_all();
> +       /* Re-enable the caches */
> +       set_cr(cr);

That being the case there's no reason to restore M and C separately on
the return path either.

[...]

> +ENTRY(lpae_pgtables_remap_asm)
> +       stmfd   sp!, {r4-r8, lr}
> +
> +       mrc     p15, 0, r8, c1, c0, 0           @ read control reg
> +       bic     ip, r8, #CR_M                   @ disable caches and MMU
> +       mcr     p15, 0, ip, c1, c0, 0
> +       dsb
> +       isb

Shouldn't the DSB be between the STMFD and the MCR (given the SP doesn't
point to an idmap/physical address)?

I don't see why we need a DSB after the write to the SCTLR.

[...]

> +       dsb
> +       isb
> +
> +       mcr     p15, 0, r8, c1, c0, 0           @ re-enable MMU
> +       dsb
> +       isb

Similarly, isn't the last DSB redundant?

Thanks,
Mark.
Russell King - ARM Linux April 8, 2015, 5:55 p.m. UTC | #4
On Wed, Apr 08, 2015 at 06:36:56PM +0100, Mark Rutland wrote:
> Hi Russell,
> 
> On Wed, Apr 08, 2015 at 10:45:30AM +0100, Russell King wrote:
> > -       /*
> > -        * Ensure that the above updates are flushed out of the cache.
> > -        * This is not strictly correct; on a system where the caches
> > -        * are coherent with each other, but the MMU page table walks
> > -        * may not be coherent, flush_cache_all() may be a no-op, and
> > -        * this will fail.
> > +        * We changing not only the virtual to physical mapping, but
> > +        * also the physical addresses used to access memory.  We need
> > +        * to flush all levels of cache in the system with caching
> > +        * disabled to ensure that all data is written back.  We do this
> > +        * with caching and write buffering disabled to ensure that
> > +        * nothing is speculatively prefetched.
> >          */
> > +       cr = get_cr();
> > +       set_cr(cr & ~(CR_I | CR_C | CR_W));
> 
> SCTLR[3] (CR_W) is RAO/SBOP in VMSAv7. I don't think we should clear it
> here for ARMv7.

I was in two minds about that - I guess as we're expecting to only run
this on ARMv7 CPUs, we can omit clearing the CR_W, but I'd need to add
a comment saying that it's ARMv7 only.

> To the best of my knowledge, the page table walkers aren't affected by
> SCTLR.C, and use the attributes in TTBR{0,1} or TTBCR when translation
> is active (i.e. when SCTLR.M is set). So at this point they can make
> cacheable accesses to the page tables (and allocate into the caches) in
> the background...

We had better clear those bits too then.

> I think that the cache flush needs to be performed after both
> SCTLR.{C,M} are cleared in lpae_pgtables_remap_asm, just before the page
> table updates. So long as the relevant pieces of kernel text are
> initially clean to the PoC, we shouldn't need to flush anything
> beforehand.

To that I say... no bloody way in hell, even once hell has frozen
over.  It took almost a _day_ to get this much working, much of it
was attempting to use cache flushing functions after the MMU had
been turned off.

If it was possible to make it work, I'd have done so.  It isn't, so
please forget the idea.

> > +ENTRY(lpae_pgtables_remap_asm)
> > +       stmfd   sp!, {r4-r8, lr}
> > +
> > +       mrc     p15, 0, r8, c1, c0, 0           @ read control reg
> > +       bic     ip, r8, #CR_M                   @ disable caches and MMU
> > +       mcr     p15, 0, ip, c1, c0, 0
> > +       dsb
> > +       isb
> 
> Shouldn't the DSB be between the STMFD and the MCR (given the SP doesn't
> point to an idmap/physical address)?
> 
> I don't see why we need a DSB after the write to the SCTLR.
> 
> [...]
> 
> > +       dsb
> > +       isb
> > +
> > +       mcr     p15, 0, r8, c1, c0, 0           @ re-enable MMU
> > +       dsb
> > +       isb
> 
> Similarly, isn't the last DSB redundant?

I've really no idea.  All I know is that the above works.  I'm rather
sick of trying to read the ARM ARM and not understanding exactly what
ISB and DSB actually do.
Russell King - ARM Linux April 8, 2015, 6:10 p.m. UTC | #5
On Wed, Apr 08, 2015 at 04:34:28PM +0200, Thomas Petazzoni wrote:
> But since I'm not going to write an e-mail just for such a small nit,
> here is an hopefully more interesting question: do you think it would
> be possible to use this logic for my Armada 370/XP/375/38x case, where
> I need some specific page table configuration to be able to use HW I/O
> coherency in non-SMP configurations?

This implementation only handles LPAE, and it only knows to add an
offset to each LPAE page table descriptor.  So, adapting this code
to that application isn't going to be easy, because you're talking
about setting and/or clearing some bits.

So, I think it would take a re-implementation of it to make it work.
Once the review comments have been addressed, at least this gives an
idea how to address the coherency issue there.
Santosh Shilimkar April 13, 2015, 7:11 p.m. UTC | #6
On 4/8/2015 10:55 AM, Russell King - ARM Linux wrote:
> On Wed, Apr 08, 2015 at 06:36:56PM +0100, Mark Rutland wrote:
>> Hi Russell,
>>
>> On Wed, Apr 08, 2015 at 10:45:30AM +0100, Russell King wrote:
>>> -       /*
>>> -        * Ensure that the above updates are flushed out of the cache.
>>> -        * This is not strictly correct; on a system where the caches
>>> -        * are coherent with each other, but the MMU page table walks
>>> -        * may not be coherent, flush_cache_all() may be a no-op, and
>>> -        * this will fail.
>>> +        * We changing not only the virtual to physical mapping, but
>>> +        * also the physical addresses used to access memory.  We need
>>> +        * to flush all levels of cache in the system with caching
>>> +        * disabled to ensure that all data is written back.  We do this
>>> +        * with caching and write buffering disabled to ensure that
>>> +        * nothing is speculatively prefetched.
>>>           */
>>> +       cr = get_cr();
>>> +       set_cr(cr & ~(CR_I | CR_C | CR_W));
>>
>> SCTLR[3] (CR_W) is RAO/SBOP in VMSAv7. I don't think we should clear it
>> here for ARMv7.
>
> I was in two minds about that - I guess as we're expecting to only run
> this on ARMv7 CPUs, we can omit clearing the CR_W, but I'd need to add
> a comment saying that it's ARMv7 only.
>
Yep. We can do away without the CR_W change.

>> To the best of my knowledge, the page table walkers aren't affected by
>> SCTLR.C, and use the attributes in TTBR{0,1} or TTBCR when translation
>> is active (i.e. when SCTLR.M is set). So at this point they can make
>> cacheable accesses to the page tables (and allocate into the caches) in
>> the background...
>
> We had better clear those bits too then.
>
>> I think that the cache flush needs to be performed after both
>> SCTLR.{C,M} are cleared in lpae_pgtables_remap_asm, just before the page
>> table updates. So long as the relevant pieces of kernel text are
>> initially clean to the PoC, we shouldn't need to flush anything
>> beforehand.
>
> To that I say... no bloody way in hell, even once hell has frozen
> over.  It took almost a _day_ to get this much working, much of it
> was attempting to use cache flushing functions after the MMU had
> been turned off.
>
> If it was possible to make it work, I'd have done so.  It isn't, so
> please forget the idea.
>
I fully agree. I gone through the same pane while incorporating Will's
comment on similar lines last time.

>>> +ENTRY(lpae_pgtables_remap_asm)
>>> +       stmfd   sp!, {r4-r8, lr}
>>> +
>>> +       mrc     p15, 0, r8, c1, c0, 0           @ read control reg
>>> +       bic     ip, r8, #CR_M                   @ disable caches and MMU
>>> +       mcr     p15, 0, ip, c1, c0, 0
>>> +       dsb
>>> +       isb
>>
>> Shouldn't the DSB be between the STMFD and the MCR (given the SP doesn't
>> point to an idmap/physical address)?
>>
>> I don't see why we need a DSB after the write to the SCTLR.
>>
dsb can be moved up after stmfd but leaving as above should be fine
as well.

>> [...]
>>
>>> +       dsb
>>> +       isb
>>> +
>>> +       mcr     p15, 0, r8, c1, c0, 0           @ re-enable MMU
>>> +       dsb
>>> +       isb
>>
>> Similarly, isn't the last DSB redundant?
>
This dsb probably can be dropped but I leave that to Russell
to decide. That one extra instruction doesn't hurt much.

Regards,
Santosh

> I've really no idea.  All I know is that the above works.  I'm rather
> sick of trying to read the ARM ARM and not understanding exactly what
> ISB and DSB actually do.
>
Mark Rutland April 15, 2015, 12:07 p.m. UTC | #7
Hi,

> >> To the best of my knowledge, the page table walkers aren't affected by
> >> SCTLR.C, and use the attributes in TTBR{0,1} or TTBCR when translation
> >> is active (i.e. when SCTLR.M is set). So at this point they can make
> >> cacheable accesses to the page tables (and allocate into the caches) in
> >> the background...
> >
> > We had better clear those bits too then.
> >
> >> I think that the cache flush needs to be performed after both
> >> SCTLR.{C,M} are cleared in lpae_pgtables_remap_asm, just before the page
> >> table updates. So long as the relevant pieces of kernel text are
> >> initially clean to the PoC, we shouldn't need to flush anything
> >> beforehand.
> >
> > To that I say... no bloody way in hell, even once hell has frozen
> > over.  It took almost a _day_ to get this much working, much of it
> > was attempting to use cache flushing functions after the MMU had
> > been turned off.
> >
> > If it was possible to make it work, I'd have done so.  It isn't, so
> > please forget the idea.
> >
> I fully agree. I gone through the same pane while incorporating Will's
> comment on similar lines last time.

I'm surprised that it's so painful to get that working. I don't have a
system I could test this on, so unfortunately I can't offer much help
there.

> >>> +ENTRY(lpae_pgtables_remap_asm)
> >>> +       stmfd   sp!, {r4-r8, lr}
> >>> +
> >>> +       mrc     p15, 0, r8, c1, c0, 0           @ read control reg
> >>> +       bic     ip, r8, #CR_M                   @ disable caches and MMU
> >>> +       mcr     p15, 0, ip, c1, c0, 0
> >>> +       dsb
> >>> +       isb
> >>
> >> Shouldn't the DSB be between the STMFD and the MCR (given the SP doesn't
> >> point to an idmap/physical address)?
> >>
> >> I don't see why we need a DSB after the write to the SCTLR.
> >>
> dsb can be moved up after stmfd but leaving as above should be fine
> as well.

I don't think that it's safe to leave it where it is. Currently the
STMFD could be reordered w.r.t. the cp15 accesses, and hence the write
may occur with translation disabled (and would go to the wrong physical
address).

We need to ensure that the STMFD is executed before the MCR potentially
changes the execution context. The DSB will ensure that in addition to
ensuring completion of the write (i.e. it isn't left sat in a write
buffer).

> >> [...]
> >>
> >>> +       dsb
> >>> +       isb
> >>> +
> >>> +       mcr     p15, 0, r8, c1, c0, 0           @ re-enable MMU
> >>> +       dsb
> >>> +       isb
> >>
> >> Similarly, isn't the last DSB redundant?
> >
> This dsb probably can be dropped but I leave that to Russell
> to decide. That one extra instruction doesn't hurt much.

I don't see that it adds anything to the ISB given the DSB; ISB prior to
the write to the SCTLR -- there's nothing it would ensure completion of
that the first DSB won't already have.

Mark.
Santosh Shilimkar April 15, 2015, 5:27 p.m. UTC | #8
On 4/15/2015 5:07 AM, Mark Rutland wrote:
> Hi,
>
[..]

>>>>> +ENTRY(lpae_pgtables_remap_asm)
>>>>> +       stmfd   sp!, {r4-r8, lr}
>>>>> +
>>>>> +       mrc     p15, 0, r8, c1, c0, 0           @ read control reg
>>>>> +       bic     ip, r8, #CR_M                   @ disable caches and MMU
>>>>> +       mcr     p15, 0, ip, c1, c0, 0
>>>>> +       dsb
>>>>> +       isb
>>>>
>>>> Shouldn't the DSB be between the STMFD and the MCR (given the SP doesn't
>>>> point to an idmap/physical address)?
>>>>
>>>> I don't see why we need a DSB after the write to the SCTLR.
>>>>
>> dsb can be moved up after stmfd but leaving as above should be fine
>> as well.
>
> I don't think that it's safe to leave it where it is. Currently the
> STMFD could be reordered w.r.t. the cp15 accesses, and hence the write
> may occur with translation disabled (and would go to the wrong physical
> address).
>
> We need to ensure that the STMFD is executed before the MCR potentially
> changes the execution context. The DSB will ensure that in addition to
> ensuring completion of the write (i.e. it isn't left sat in a write
> buffer).
>
I see your point. Thanks for expanding it.

Regards,
Santosh
Mark Rutland April 23, 2015, 11:24 a.m. UTC | #9
Hi,

> >>>>> +ENTRY(lpae_pgtables_remap_asm)
> >>>>> +       stmfd   sp!, {r4-r8, lr}
> >>>>> +
> >>>>> +       mrc     p15, 0, r8, c1, c0, 0           @ read control reg
> >>>>> +       bic     ip, r8, #CR_M                   @ disable caches and MMU
> >>>>> +       mcr     p15, 0, ip, c1, c0, 0
> >>>>> +       dsb
> >>>>> +       isb
> >>>>
> >>>> Shouldn't the DSB be between the STMFD and the MCR (given the SP doesn't
> >>>> point to an idmap/physical address)?
> >>>>
> >>>> I don't see why we need a DSB after the write to the SCTLR.
> >>>>
> >> dsb can be moved up after stmfd but leaving as above should be fine
> >> as well.
> >
> > I don't think that it's safe to leave it where it is. Currently the
> > STMFD could be reordered w.r.t. the cp15 accesses, and hence the write
> > may occur with translation disabled (and would go to the wrong physical
> > address).
> >
> > We need to ensure that the STMFD is executed before the MCR potentially
> > changes the execution context. The DSB will ensure that in addition to
> > ensuring completion of the write (i.e. it isn't left sat in a write
> > buffer).
> >
> I see your point. Thanks for expanding it.

It turns out that I was incorrect in my assertion, and the reordering I
suggested above can't happen. The ARMv7 ARM states:

	Any direct write to a system control register is guaranteed not
	to affect any instruction that appears, in program
	order, before the instruction that performed the direct write

Which means that the STMFD cannot be affected by the later cp15 write to
the SCTLR, and so the DSB does not need to be moved before the MCR.

I apologise for adding to the confusion there.

Mark.
Russell King - ARM Linux May 6, 2015, 10:18 a.m. UTC | #10
On Thu, Apr 23, 2015 at 12:24:58PM +0100, Mark Rutland wrote:
> Hi,
> 
> > >>>>> +ENTRY(lpae_pgtables_remap_asm)
> > >>>>> +       stmfd   sp!, {r4-r8, lr}
> > >>>>> +
> > >>>>> +       mrc     p15, 0, r8, c1, c0, 0           @ read control reg
> > >>>>> +       bic     ip, r8, #CR_M                   @ disable caches and MMU
> > >>>>> +       mcr     p15, 0, ip, c1, c0, 0
> > >>>>> +       dsb
> > >>>>> +       isb
> > >>>>
> > >>>> Shouldn't the DSB be between the STMFD and the MCR (given the SP doesn't
> > >>>> point to an idmap/physical address)?
> > >>>>
> > >>>> I don't see why we need a DSB after the write to the SCTLR.
> > >>>>
> > >> dsb can be moved up after stmfd but leaving as above should be fine
> > >> as well.
> > >
> > > I don't think that it's safe to leave it where it is. Currently the
> > > STMFD could be reordered w.r.t. the cp15 accesses, and hence the write
> > > may occur with translation disabled (and would go to the wrong physical
> > > address).
> > >
> > > We need to ensure that the STMFD is executed before the MCR potentially
> > > changes the execution context. The DSB will ensure that in addition to
> > > ensuring completion of the write (i.e. it isn't left sat in a write
> > > buffer).
> > >
> > I see your point. Thanks for expanding it.
> 
> It turns out that I was incorrect in my assertion, and the reordering I
> suggested above can't happen. The ARMv7 ARM states:
> 
> 	Any direct write to a system control register is guaranteed not
> 	to affect any instruction that appears, in program
> 	order, before the instruction that performed the direct write
> 
> Which means that the STMFD cannot be affected by the later cp15 write to
> the SCTLR, and so the DSB does not need to be moved before the MCR.
> 
> I apologise for adding to the confusion there.

So does this mean this patch gets an ack now?
Mark Rutland May 6, 2015, 10:37 a.m. UTC | #11
Hi Russell,

> > It turns out that I was incorrect in my assertion, and the reordering I
> > suggested above can't happen. The ARMv7 ARM states:
> > 
> > 	Any direct write to a system control register is guaranteed not
> > 	to affect any instruction that appears, in program
> > 	order, before the instruction that performed the direct write
> > 
> > Which means that the STMFD cannot be affected by the later cp15 write to
> > the SCTLR, and so the DSB does not need to be moved before the MCR.
> > 
> > I apologise for adding to the confusion there.
> 
> So does this mean this patch gets an ack now?

I assumed there was going to be a respin for the CR_W change?

There's also the dodginess w.r.t. the page table walkers that I can't
see is solvable short of disabling the MMU prior to the flush, though I
understand you've NAKed that approach.

Thanks,
Mark.
Russell King - ARM Linux May 6, 2015, 11:33 a.m. UTC | #12
On Wed, May 06, 2015 at 11:37:37AM +0100, Mark Rutland wrote:
> Hi Russell,
> 
> > > It turns out that I was incorrect in my assertion, and the reordering I
> > > suggested above can't happen. The ARMv7 ARM states:
> > > 
> > > 	Any direct write to a system control register is guaranteed not
> > > 	to affect any instruction that appears, in program
> > > 	order, before the instruction that performed the direct write
> > > 
> > > Which means that the STMFD cannot be affected by the later cp15 write to
> > > the SCTLR, and so the DSB does not need to be moved before the MCR.
> > > 
> > > I apologise for adding to the confusion there.
> > 
> > So does this mean this patch gets an ack now?
> 
> I assumed there was going to be a respin for the CR_W change?
> 
> There's also the dodginess w.r.t. the page table walkers that I can't
> see is solvable short of disabling the MMU prior to the flush, though I
> understand you've NAKed that approach.

Why?  Are you saying that after:

+       mrc     p15, 0, r8, c1, c0, 0           @ read control reg
+       bic     ip, r8, #CR_M                   @ disable caches and MMU
+       mcr     p15, 0, ip, c1, c0, 0
+       dsb
+       isb

the page table walkers are still actively walking the page table?

That to me sounds like a hardware bug.  The point of this is to shut
down the MMU, _then_ update the page tables, and _then_ to re-enable
the MMU to explicitly avoid problems with the page table walkers.

From what you're saying, this is just not possible, and we should
just throw Keystone SoCs in the bin...
Mark Rutland May 6, 2015, 3:33 p.m. UTC | #13
On Wed, May 06, 2015 at 12:33:13PM +0100, Russell King - ARM Linux wrote:
> On Wed, May 06, 2015 at 11:37:37AM +0100, Mark Rutland wrote:
> > Hi Russell,
> > 
> > > > It turns out that I was incorrect in my assertion, and the reordering I
> > > > suggested above can't happen. The ARMv7 ARM states:
> > > > 
> > > > 	Any direct write to a system control register is guaranteed not
> > > > 	to affect any instruction that appears, in program
> > > > 	order, before the instruction that performed the direct write
> > > > 
> > > > Which means that the STMFD cannot be affected by the later cp15 write to
> > > > the SCTLR, and so the DSB does not need to be moved before the MCR.
> > > > 
> > > > I apologise for adding to the confusion there.
> > > 
> > > So does this mean this patch gets an ack now?
> > 
> > I assumed there was going to be a respin for the CR_W change?
> > 
> > There's also the dodginess w.r.t. the page table walkers that I can't
> > see is solvable short of disabling the MMU prior to the flush, though I
> > understand you've NAKed that approach.
> 
> Why? 

I was on about the pre-assembly portion:

	cr = get_cr();
	set_cr(cr & ~(CR_I | CR_C | CR_W));
	flush_cache_all();

With the MMU on at this point the page table walkers can race with the
set/way maintenance. It also relies on the compiler not making stack
accesses between the SCTLR write and the completion of flush_cache_all,
which is likely but not guranteed.

So this won't necessarily flush out the data it seems to be intended to.

> Are you saying that after:
> 
> +       mrc     p15, 0, r8, c1, c0, 0           @ read control reg
> +       bic     ip, r8, #CR_M                   @ disable caches and MMU
> +       mcr     p15, 0, ip, c1, c0, 0
> +       dsb
> +       isb
> 
> the page table walkers are still actively walking the page table?
> 
> That to me sounds like a hardware bug.  The point of this is to shut
> down the MMU, _then_ update the page tables, and _then_ to re-enable
> the MMU to explicitly avoid problems with the page table walkers.

I agree that after this point it would be a bug for the page table
walkers to make cacheable accesses.

Thanks,
Mark.
Russell King - ARM Linux May 6, 2015, 3:50 p.m. UTC | #14
On Wed, May 06, 2015 at 04:33:38PM +0100, Mark Rutland wrote:
> On Wed, May 06, 2015 at 12:33:13PM +0100, Russell King - ARM Linux wrote:
> > On Wed, May 06, 2015 at 11:37:37AM +0100, Mark Rutland wrote:
> > > Hi Russell,
> > > 
> > > > > It turns out that I was incorrect in my assertion, and the reordering I
> > > > > suggested above can't happen. The ARMv7 ARM states:
> > > > > 
> > > > > 	Any direct write to a system control register is guaranteed not
> > > > > 	to affect any instruction that appears, in program
> > > > > 	order, before the instruction that performed the direct write
> > > > > 
> > > > > Which means that the STMFD cannot be affected by the later cp15 write to
> > > > > the SCTLR, and so the DSB does not need to be moved before the MCR.
> > > > > 
> > > > > I apologise for adding to the confusion there.
> > > > 
> > > > So does this mean this patch gets an ack now?
> > > 
> > > I assumed there was going to be a respin for the CR_W change?
> > > 
> > > There's also the dodginess w.r.t. the page table walkers that I can't
> > > see is solvable short of disabling the MMU prior to the flush, though I
> > > understand you've NAKed that approach.
> > 
> > Why? 
> 
> I was on about the pre-assembly portion:
> 
> 	cr = get_cr();
> 	set_cr(cr & ~(CR_I | CR_C | CR_W));
> 	flush_cache_all();
> 
> With the MMU on at this point the page table walkers can race with the
> set/way maintenance. It also relies on the compiler not making stack
> accesses between the SCTLR write and the completion of flush_cache_all,
> which is likely but not guranteed.

Yes, I suppose you're right, but there's really no other way to do
this other than coding up CPU specific code.

What we'd need to do is to code up a specific assembly routine which
disables the I and C control register flags, flushes the cache, jumps
to the physical alias and then disables the MMU all without touching
memory.  That's far too much effort.  Like I say, maybe we should just
bin Keystone for this crap...

I'm not interested in trying to "fix" this code any further - as I
said earlier, it took quite a while to get this code working on
Keystone, which is really all we care about.  Anyone else who copies
the abortion that TI made in Keystone should be shot. :)

While the community has the luxury of saying "we don't like it, it's
TI's problem" - which is what was done when the problems were first
pointed out by Will, the net result is that this problem became my
problem to try and sort out.

So, if you have a better idea how to fix this, I'm all ears.

What I'm certain of though is that this is an improvement over what's
in the kernel today, and so should go in irrespective of whether it's
totally bullet-proof or not.
Mark Rutland May 6, 2015, 4:14 p.m. UTC | #15
Hi Russell,

> > 	cr = get_cr();
> > 	set_cr(cr & ~(CR_I | CR_C | CR_W));
> > 	flush_cache_all();
> > 
> > With the MMU on at this point the page table walkers can race with the
> > set/way maintenance. It also relies on the compiler not making stack
> > accesses between the SCTLR write and the completion of flush_cache_all,
> > which is likely but not guranteed.
> 
> Yes, I suppose you're right, but there's really no other way to do
> this other than coding up CPU specific code.
> 
> What we'd need to do is to code up a specific assembly routine which
> disables the I and C control register flags, flushes the cache, jumps
> to the physical alias and then disables the MMU all without touching
> memory.  That's far too much effort.  Like I say, maybe we should just
> bin Keystone for this crap...
> 
> I'm not interested in trying to "fix" this code any further - as I
> said earlier, it took quite a while to get this code working on
> Keystone, which is really all we care about.  Anyone else who copies
> the abortion that TI made in Keystone should be shot. :)
> 
> While the community has the luxury of saying "we don't like it, it's
> TI's problem" - which is what was done when the problems were first
> pointed out by Will, the net result is that this problem became my
> problem to try and sort out.
> 
> So, if you have a better idea how to fix this, I'm all ears.
>
> What I'm certain of though is that this is an improvement over what's
> in the kernel today, and so should go in irrespective of whether it's
> totally bullet-proof or not.

I don't disagree with that. :)

w.r.t. "better" ideas, my initial thought was that we could move the
SCTLR.{C,M} clearing into lpae_pgtables_remap_asm, along with a call to
v7_flush_dcache_all (as it should be in the physical mapping). So long
as the kernel text was initially clean to the PoC I'd expect that to
work, but I understood from your initial reply you'd tried that and
something went wrong.

Perhaps we can look into that later.

Thanks,
Mark.
Will Deacon May 6, 2015, 4:24 p.m. UTC | #16
On Wed, May 06, 2015 at 05:14:06PM +0100, Mark Rutland wrote:
> > > 	cr = get_cr();
> > > 	set_cr(cr & ~(CR_I | CR_C | CR_W));
> > > 	flush_cache_all();
> > > 
> > > With the MMU on at this point the page table walkers can race with the
> > > set/way maintenance. It also relies on the compiler not making stack
> > > accesses between the SCTLR write and the completion of flush_cache_all,
> > > which is likely but not guranteed.
> > 
> > Yes, I suppose you're right, but there's really no other way to do
> > this other than coding up CPU specific code.
> > 
> > What we'd need to do is to code up a specific assembly routine which
> > disables the I and C control register flags, flushes the cache, jumps
> > to the physical alias and then disables the MMU all without touching
> > memory.  That's far too much effort.  Like I say, maybe we should just
> > bin Keystone for this crap...
> > 
> > I'm not interested in trying to "fix" this code any further - as I
> > said earlier, it took quite a while to get this code working on
> > Keystone, which is really all we care about.  Anyone else who copies
> > the abortion that TI made in Keystone should be shot. :)
> > 
> > While the community has the luxury of saying "we don't like it, it's
> > TI's problem" - which is what was done when the problems were first
> > pointed out by Will, the net result is that this problem became my
> > problem to try and sort out.
> > 
> > So, if you have a better idea how to fix this, I'm all ears.
> >
> > What I'm certain of though is that this is an improvement over what's
> > in the kernel today, and so should go in irrespective of whether it's
> > totally bullet-proof or not.
> 
> I don't disagree with that. :)
> 
> w.r.t. "better" ideas, my initial thought was that we could move the
> SCTLR.{C,M} clearing into lpae_pgtables_remap_asm, along with a call to
> v7_flush_dcache_all (as it should be in the physical mapping). So long
> as the kernel text was initially clean to the PoC I'd expect that to
> work, but I understood from your initial reply you'd tried that and
> something went wrong.
> 
> Perhaps we can look into that later.

I think a comment summarising the conclusion of this thread above the
cache flush and rmk's choice of adjective to describe the keystone SoC
would be sufficient.

Will
diff mbox

Patch

diff --git a/arch/arm/mm/Kconfig b/arch/arm/mm/Kconfig
index b7644310236b..cc873e74643a 100644
--- a/arch/arm/mm/Kconfig
+++ b/arch/arm/mm/Kconfig
@@ -624,6 +624,10 @@  config ARM_LPAE
 
 	  If unsure, say N.
 
+config ARM_PV_FIXUP
+	def_bool y
+	depends on ARM_LPAE && ARM_PATCH_PHYS_VIRT && ARCH_KEYSTONE
+
 config ARCH_PHYS_ADDR_T_64BIT
 	def_bool ARM_LPAE
 
diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile
index 19df9bd9a79f..290ec5056b3b 100644
--- a/arch/arm/mm/Makefile
+++ b/arch/arm/mm/Makefile
@@ -18,6 +18,7 @@  obj-$(CONFIG_MODULES)		+= proc-syms.o
 obj-$(CONFIG_ALIGNMENT_TRAP)	+= alignment.o
 obj-$(CONFIG_HIGHMEM)		+= highmem.o
 obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
+obj-$(CONFIG_ARM_PV_FIXUP)	+= pv-fixup-asm.o
 
 obj-$(CONFIG_CPU_ABRT_NOMMU)	+= abort-nommu.o
 obj-$(CONFIG_CPU_ABRT_EV4)	+= abort-ev4.o
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index 4eab87eab5aa..addb7987c714 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -1387,7 +1387,11 @@  static void __init map_lowmem(void)
 	}
 }
 
-#if defined(CONFIG_ARM_LPAE) && defined(CONFIG_ARM_PATCH_PHYS_VIRT)
+#ifdef CONFIG_ARM_PV_FIXUP
+extern unsigned long __atags_pointer;
+typedef void pgtables_remap(long long offset, unsigned long pgd, void *bdata);
+pgtables_remap lpae_pgtables_remap_asm;
+
 /*
  * early_paging_init() recreates boot time page table setup, allowing machines
  * to switch over to a high (>4G) address space on LPAE systems
@@ -1395,35 +1399,30 @@  static void __init map_lowmem(void)
 void __init early_paging_init(const struct machine_desc *mdesc,
 			      struct proc_info_list *procinfo)
 {
-	pmdval_t pmdprot = procinfo->__cpu_mm_mmu_flags;
-	unsigned long map_start, map_end;
+	pgtables_remap *lpae_pgtables_remap;
+	unsigned long pa_pgd;
+	unsigned int cr;
 	long long offset;
-	pgd_t *pgd0, *pgdk;
-	pud_t *pud0, *pudk, *pud_start;
-	pmd_t *pmd0, *pmdk;
-	phys_addr_t phys;
-	int i;
+	void *boot_data;
 
 	if (!mdesc->pv_fixup)
 		return;
 
-	/* remap kernel code and data */
-	map_start = init_mm.start_code & PMD_MASK;
-	map_end   = ALIGN(init_mm.brk, PMD_SIZE);
-
-	/* get a handle on things... */
-	pgd0 = pgd_offset_k(0);
-	pud_start = pud0 = pud_offset(pgd0, 0);
-	pmd0 = pmd_offset(pud0, 0);
-
-	pgdk = pgd_offset_k(map_start);
-	pudk = pud_offset(pgdk, map_start);
-	pmdk = pmd_offset(pudk, map_start);
-
 	offset = mdesc->pv_fixup();
 	if (offset == 0)
 		return;
 
+	/*
+	 * Get the address of the remap function in the 1:1 identity
+	 * mapping setup by the early page table assembly code.  We
+	 * must get this prior to the pv update.  The following barrier
+	 * ensures that this is complete before we fixup any P:V offsets.
+	 */
+	lpae_pgtables_remap = (pgtables_remap *)(unsigned long)__pa(lpae_pgtables_remap_asm);
+	pa_pgd = __pa(swapper_pg_dir);
+	boot_data = __va(__atags_pointer);
+	barrier();
+
 	pr_info("Switching physical address space to 0x%08llx\n",
 		(u64)PHYS_OFFSET + offset);
 
@@ -1436,75 +1435,27 @@  void __init early_paging_init(const struct machine_desc *mdesc,
 		(&__pv_table_end - &__pv_table_begin) << 2);
 
 	/*
-	 * Cache cleaning operations for self-modifying code
-	 * We should clean the entries by MVA but running a
-	 * for loop over every pv_table entry pointer would
-	 * just complicate the code.
-	 */
-	flush_cache_louis();
-	dsb(ishst);
-	isb();
-
-	/*
-	 * FIXME: This code is not architecturally compliant: we modify
-	 * the mappings in-place, indeed while they are in use by this
-	 * very same code.  This may lead to unpredictable behaviour of
-	 * the CPU.
-	 *
-	 * Even modifying the mappings in a separate page table does
-	 * not resolve this.
-	 *
-	 * The architecture strongly recommends that when a mapping is
-	 * changed, that it is changed by first going via an invalid
-	 * mapping and back to the new mapping.  This is to ensure that
-	 * no TLB conflicts (caused by the TLB having more than one TLB
-	 * entry match a translation) can occur.  However, doing that
-	 * here will result in unmapping the code we are running.
-	 */
-	pr_warn("WARNING: unsafe modification of in-place page tables - tainting kernel\n");
-	add_taint(TAINT_CPU_OUT_OF_SPEC, LOCKDEP_STILL_OK);
-
-	/*
-	 * Remap level 1 table.  This changes the physical addresses
-	 * used to refer to the level 2 page tables to the high
-	 * physical address alias, leaving everything else the same.
-	 */
-	for (i = 0; i < PTRS_PER_PGD; pud0++, i++) {
-		set_pud(pud0,
-			__pud(__pa(pmd0) | PMD_TYPE_TABLE | L_PGD_SWAPPER));
-		pmd0 += PTRS_PER_PMD;
-	}
-
-	/*
-	 * Remap the level 2 table, pointing the mappings at the high
-	 * physical address alias of these pages.
-	 */
-	phys = __pa(map_start);
-	do {
-		*pmdk++ = __pmd(phys | pmdprot);
-		phys += PMD_SIZE;
-	} while (phys < map_end);
-
-	/*
-	 * Ensure that the above updates are flushed out of the cache.
-	 * This is not strictly correct; on a system where the caches
-	 * are coherent with each other, but the MMU page table walks
-	 * may not be coherent, flush_cache_all() may be a no-op, and
-	 * this will fail.
+	 * We changing not only the virtual to physical mapping, but
+	 * also the physical addresses used to access memory.  We need
+	 * to flush all levels of cache in the system with caching
+	 * disabled to ensure that all data is written back.  We do this
+	 * with caching and write buffering disabled to ensure that
+	 * nothing is speculatively prefetched.
 	 */
+	cr = get_cr();
+	set_cr(cr & ~(CR_I | CR_C | CR_W));
 	flush_cache_all();
 
 	/*
-	 * Re-write the TTBR values to point them at the high physical
-	 * alias of the page tables.  We expect __va() will work on
-	 * cpu_get_pgd(), which returns the value of TTBR0.
+	 * Fixup the page tables - this must be in the idmap region as
+	 * we need to disable the MMU to do this safely, and hence it
+	 * needs to be assembly.  It's fairly simple, as we're using the
+	 * temporary tables setup by the initial assembly code.
 	 */
-	cpu_switch_mm(pgd0, &init_mm);
-	cpu_set_ttbr(1, __pa(pgd0) + TTBR1_OFFSET);
+	lpae_pgtables_remap(offset, pa_pgd, boot_data);
 
-	/* Finally flush any stale TLB values. */
-	local_flush_bp_all();
-	local_flush_tlb_all();
+	/* Re-enable the caches */
+	set_cr(cr);
 }
 
 #else
diff --git a/arch/arm/mm/pv-fixup-asm.S b/arch/arm/mm/pv-fixup-asm.S
new file mode 100644
index 000000000000..1867f3e43016
--- /dev/null
+++ b/arch/arm/mm/pv-fixup-asm.S
@@ -0,0 +1,88 @@ 
+/*
+ *  Copyright (C) 2015 Russell King
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This assembly is required to safely remap the physical address space
+ * for Keystone 2
+ */
+#include <linux/linkage.h>
+#include <asm/asm-offsets.h>
+#include <asm/cp15.h>
+#include <asm/memory.h>
+#include <asm/pgtable.h>
+
+	.section ".idmap.text", "ax"
+
+#define L1_ORDER 3
+#define L2_ORDER 3
+
+ENTRY(lpae_pgtables_remap_asm)
+	stmfd	sp!, {r4-r8, lr}
+
+	mrc	p15, 0, r8, c1, c0, 0		@ read control reg
+	bic	ip, r8, #CR_M			@ disable caches and MMU
+	mcr	p15, 0, ip, c1, c0, 0
+	dsb
+	isb
+
+	/* Update level 2 entries covering the kernel */
+	ldr	r6, =(_end - 1)
+	add	r7, r2, #0x1000
+	add	r6, r7, r6, lsr #SECTION_SHIFT - L2_ORDER
+	add	r7, r7, #PAGE_OFFSET >> (SECTION_SHIFT - L2_ORDER)
+1:	ldrd	r4, [r7]
+	adds	r4, r4, r0
+	adc	r5, r5, r1
+	strd	r4, [r7], #1 << L2_ORDER
+	cmp	r7, r6
+	bls	1b
+
+	/* Update level 2 entries for the boot data */
+	add	r7, r2, #0x1000
+	add	r7, r7, r3, lsr #SECTION_SHIFT - L2_ORDER
+	bic	r7, r7, #(1 << L2_ORDER) - 1
+	ldrd	r4, [r7]
+	adds	r4, r4, r0
+	adc	r5, r5, r1
+	strd	r4, [r7], #1 << L2_ORDER
+	ldrd	r4, [r7]
+	adds	r4, r4, r0
+	adc	r5, r5, r1
+	strd	r4, [r7]
+
+	/* Update level 1 entries */
+	mov	r6, #4
+	mov	r7, r2
+2:	ldrd	r4, [r7]
+	adds	r4, r4, r0
+	adc	r5, r5, r1
+	strd	r4, [r7], #1 << L1_ORDER
+	subs	r6, r6, #1
+	bne	2b
+
+	mrrc	p15, 0, r4, r5, c2		@ read TTBR0
+	adds	r4, r4, r0			@ update physical address
+	adc	r5, r5, r1
+	mcrr	p15, 0, r4, r5, c2		@ write back TTBR0
+	mrrc	p15, 1, r4, r5, c2		@ read TTBR1
+	adds	r4, r4, r0			@ update physical address
+	adc	r5, r5, r1
+	mcrr	p15, 1, r4, r5, c2		@ write back TTBR1
+
+	dsb
+
+	mov	ip, #0
+	mcr	p15, 0, ip, c7, c5, 0		@ I+BTB cache invalidate
+	mcr	p15, 0, ip, c8, c7, 0		@ local_flush_tlb_all()
+	dsb
+	isb
+
+	mcr	p15, 0, r8, c1, c0, 0		@ re-enable MMU
+	dsb
+	isb
+
+	ldmfd	sp!, {r4-r8, pc}
+ENDPROC(lpae_pgtables_remap_asm)