sh: dcache flush breaks text region?
diff mbox

Message ID 4968DD28.3030709@juno.dti.ne.jp
State Rejected
Delegated to: Paul Mundt
Headers show

Commit Message

Shin-ichiro KAWASAKI Jan. 10, 2009, 5:38 p.m. UTC
Hi, all.

I'm now working on to expand qemu-sh to emulate
"Solution Engine 7750", and found one odd thing.
Could you give me some advice?

My SH7750 emulation environment fails to boot up.
I made some investigation and found that,
 - the linux kernel for SE7750(se7750_defconfig) flushes
   dcache on its boot sequence.
 - SH7750's dcache is 16KB and direct-map.
   Then 16KB memory region are touched and modified to flush it.
 - empty_zero_page is used for this flush, but it only has
   4KB.  The text region after it has got broken and causes
   boot failure.

I added a patch against linux kernel to this mail for a reference.
It only reduces the flush region size to 4KB=PAGE_SIZE, but avoids
the problem and let the kernel boot up cleanly.
Of course it is not a good solution, because it does not flush all
caches.

I wonder two points.
 - Does this problem happen on real SE7750 board?
   In other words, does the current linux kernel work on it?
   I don't have it, and can't check it out by myself.
 - How should I solve the problem?
   16KB region should be allocated for flush by kernel?

The patches for SE7750 emulation is not yet posted
to qemu-devel. Before it, I'd like to solve the problem.

Any comments will be appreciated.

Regards,
Shin-ichiro KAWASAKI



--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Edgar E. Iglesias Jan. 10, 2009, 7:53 p.m. UTC | #1
On Sun, Jan 11, 2009 at 02:38:48AM +0900, Shin-ichiro KAWASAKI wrote:
> Hi, all.
> 
> I'm now working on to expand qemu-sh to emulate
> "Solution Engine 7750", and found one odd thing.
> Could you give me some advice?
> 
> My SH7750 emulation environment fails to boot up.
> I made some investigation and found that,
>  - the linux kernel for SE7750(se7750_defconfig) flushes
>    dcache on its boot sequence.
>  - SH7750's dcache is 16KB and direct-map.
>    Then 16KB memory region are touched and modified to flush it.
>  - empty_zero_page is used for this flush, but it only has
>    4KB.  The text region after it has got broken and causes
>    boot failure.
> 
> I added a patch against linux kernel to this mail for a reference.
> It only reduces the flush region size to 4KB=PAGE_SIZE, but avoids
> the problem and let the kernel boot up cleanly.
> Of course it is not a good solution, because it does not flush all
> caches.
> 
> I wonder two points.
>  - Does this problem happen on real SE7750 board?

Hello,

I'm not very familiar with sh arch so please take this with a grain
of salt :)

It's not entirely clear to me if the bug will show up on silicon, but my
guess is that it wont.

From my understating of the docs, the movca store will for misses in the
cache be processed with a write-validate write-miss policy. That means that
the movca store will allocate the line (flushing any previous content if
needed) but not fetch any data corresponding to the movca store address.
The sh7750 does not have multiple dirty bits per line so that kind of
treatment leaves the unwritten parts of the line with unpredictable results.

Such insns can be very useful for fast block copies through writeback caches
that otherwise do a line fetch for write-misses.

So, when the ocbi insn invalidates the line, no write back is done and the
downstream busses never see the movca store.

I'm not sure how to handle this in qemu without adding cache models.
One way to handle this particular cacheflush sequence might be to delay all
movca stores until there's another load/store or cache control insn being
issued to help you figure out if you can ignore previous movca. That will
not by any means cover all cases though.

Another solution might be for linux to use a ocpb followed by a ocpi insn
on the line. IIUC that should achieve the same results net results.

Cheers
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shin-ichiro KAWASAKI Jan. 11, 2009, 3:58 a.m. UTC | #2
Edgar E. Iglesias wrote:
> On Sun, Jan 11, 2009 at 02:38:48AM +0900, Shin-ichiro KAWASAKI wrote:
>> Hi, all.
>>
>> I'm now working on to expand qemu-sh to emulate
>> "Solution Engine 7750", and found one odd thing.
>> Could you give me some advice?
>>
>> My SH7750 emulation environment fails to boot up.
>> I made some investigation and found that,
>>  - the linux kernel for SE7750(se7750_defconfig) flushes
>>    dcache on its boot sequence.
>>  - SH7750's dcache is 16KB and direct-map.
>>    Then 16KB memory region are touched and modified to flush it.
>>  - empty_zero_page is used for this flush, but it only has
>>    4KB.  The text region after it has got broken and causes
>>    boot failure.
>>
>> I added a patch against linux kernel to this mail for a reference.
>> It only reduces the flush region size to 4KB=PAGE_SIZE, but avoids
>> the problem and let the kernel boot up cleanly.
>> Of course it is not a good solution, because it does not flush all
>> caches.
>>
>> I wonder two points.
>>  - Does this problem happen on real SE7750 board?
> 
> Hello,
> 
> I'm not very familiar with sh arch so please take this with a grain
> of salt :)
> 
> It's not entirely clear to me if the bug will show up on silicon, but my
> guess is that it wont.
> 
>>From my understating of the docs, the movca store will for misses in the
> cache be processed with a write-validate write-miss policy. That means that
> the movca store will allocate the line (flushing any previous content if
> needed) but not fetch any data corresponding to the movca store address.
> The sh7750 does not have multiple dirty bits per line so that kind of
> treatment leaves the unwritten parts of the line with unpredictable results.
> 
> Such insns can be very useful for fast block copies through writeback caches
> that otherwise do a line fetch for write-misses.
> 
> So, when the ocbi insn invalidates the line, no write back is done and the
> downstream busses never see the movca store.

Thanks a lot!  This explains the situation.
I haven't understood what movca does.

> I'm not sure how to handle this in qemu without adding cache models.
That seems a too big work and might have performance drawback.

> One way to handle this particular cacheflush sequence might be to delay all
> movca stores until there's another load/store or cache control insn being
> issued to help you figure out if you can ignore previous movca. That will
> not by any means cover all cases though.
It seems a good way to avoid this problem.
My current modification plan is as follows.
 - On executing 'movca', just record the store task which movca
   should do into CPUStatus.
 - On executing 'ocbi', delete the store task.
 - Let TCG produce 'delayed_movca' instruction for
   the first 'memory touching insn' or 'exception producing insn'
   after movca.
 - On executing 'delayed_movca', do the store tasks. 

> Another solution might be for linux to use a ocpb followed by a ocpi insn
> on the line. IIUC that should achieve the same results net results.
I'm not sure about it.  But I think we should not modify linux,
because now I guess that the current linux works on real silicon. 

Thanks again!

Regards,
Shin-ichiro KAWASAKI
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Edgar E. Iglesias Jan. 11, 2009, 10:42 a.m. UTC | #3
On Sun, Jan 11, 2009 at 12:58:12PM +0900, Shin-ichiro KAWASAKI wrote:
>
> Edgar E. Iglesias wrote:
>> Another solution might be for linux to use a ocpb followed by a ocpi insn
>> on the line. IIUC that should achieve the same results net results.
> I'm not sure about it.  But I think we should not modify linux,

I agree.

The ocpb followed by the ocpi that I suggested won't work and I cant
think of anything better than what linux is already doing.

Best regards
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Mundt Jan. 12, 2009, 12:58 p.m. UTC | #4
On Sun, Jan 11, 2009 at 12:58:12PM +0900, Shin-ichiro KAWASAKI wrote:
> Edgar E. Iglesias wrote:
> >One way to handle this particular cacheflush sequence might be to delay all
> >movca stores until there's another load/store or cache control insn being
> >issued to help you figure out if you can ignore previous movca. That will
> >not by any means cover all cases though.
> It seems a good way to avoid this problem.
> My current modification plan is as follows.
> - On executing 'movca', just record the store task which movca
>   should do into CPUStatus.
> - On executing 'ocbi', delete the store task.
> - Let TCG produce 'delayed_movca' instruction for
>   the first 'memory touching insn' or 'exception producing insn'
>   after movca.
> - On executing 'delayed_movca', do the store tasks. 
> 
There are other ways in which movca is used as well, including with
ocbi/ocbwb (SH-X2 and later can also do ocbp), as well as with the movca
line being operated on by mov later without any explicit manipulation of
the dcache (common behaviour in read paths).

If you are going to model it in CPUStatus, you are going to have to
effectively have something that spans the size of the cache and watch out
for accessors. Not all will be through the dcache modifier instructions,
remember that memory-mapped access is also used for flushing in other
areas.

> >Another solution might be for linux to use a ocpb followed by a ocpi insn
> >on the line. IIUC that should achieve the same results net results.
> I'm not sure about it.  But I think we should not modify linux,
> because now I guess that the current linux works on real silicon. 
> 
Yes, we do not want to modify linux for this. Implementing real caches in
qemu is not going to be easy, but the kernel at least does have a
CONFIG_CACHE_OFF which you can select for qemu. If there is a page we can
test somewhere to figure out if we are under emulation we can likewise
just turn them off directly at boot.

Note that qemu also needs to be aware of the movca behavioural
differences between cache enabled and disabled.
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Edgar E. Iglesias Jan. 13, 2009, 12:58 a.m. UTC | #5
On Mon, Jan 12, 2009 at 09:58:26PM +0900, Paul Mundt wrote:
> On Sun, Jan 11, 2009 at 12:58:12PM +0900, Shin-ichiro KAWASAKI wrote:
> > Edgar E. Iglesias wrote:
> > >One way to handle this particular cacheflush sequence might be to delay all
> > >movca stores until there's another load/store or cache control insn being
> > >issued to help you figure out if you can ignore previous movca. That will
> > >not by any means cover all cases though.
> > It seems a good way to avoid this problem.
> > My current modification plan is as follows.
> > - On executing 'movca', just record the store task which movca
> >   should do into CPUStatus.
> > - On executing 'ocbi', delete the store task.
> > - Let TCG produce 'delayed_movca' instruction for
> >   the first 'memory touching insn' or 'exception producing insn'
> >   after movca.
> > - On executing 'delayed_movca', do the store tasks. 
> > 
> There are other ways in which movca is used as well, including with
> ocbi/ocbwb (SH-X2 and later can also do ocbp), as well as with the movca
> line being operated on by mov later without any explicit manipulation of
> the dcache (common behaviour in read paths).

Yes but AFAICT the non ocbi cases are already kind of emulated (as good as
it gets without having cache models) both before and after the suggested
patch.

I think there might be a bit of confusion (maybe on my side). IIUC, the issue
is not really with the movca insn but more with the ocbi. Actually, I expect
any store immediately followed by ocbi to get canceled with the 7750
writeback cache policies (assuming no exceptions break the sequence). The
reason for using movca instead of mov when flushing the chaches is only to
avoid the unnecessary line fetch in case the mov misses.

The tricky part for QEMU to emulate is the ocbi in writeback mode because it
means we suddenly might need to ignore previous stores. Luckily most of the
ocbi use-cases are not realistic (many even give unpredictable results) and
can IMO be left as unsupported features of QEMU. AFAICT, only the
movca + ocbi cache flush+invalidate sequences have important use-cases that
are currently not handled by QEMU. I don't really want to advocate for a
change or not, but if the sh folks think it's important to handle that
particular sequence it might be possible with something along the lines with
Shin-ichiro's patch.

> If you are going to model it in CPUStatus, you are going to have to
> effectively have something that spans the size of the cache and watch out
> for accessors. Not all will be through the dcache modifier instructions,
> remember that memory-mapped access is also used for flushing in other
> areas.

If I understand the docs correctly the cache will always writeback dirty
entries when beeing invalidated in this manner, I think QEMU can ignore these
because everything is already in memory by then.

This led me to think of another issue, the cacheability bit in the tlb
entries. But why would sw do a movca + ocbi to an uncached address?
Can probably be ignored, but the suggested patch will incorrectly cancel
those stores.

> > >Another solution might be for linux to use a ocpb followed by a ocpi insn
> > >on the line. IIUC that should achieve the same results net results.
> > I'm not sure about it.  But I think we should not modify linux,
> > because now I guess that the current linux works on real silicon. 
> > 
> Yes, we do not want to modify linux for this. Implementing real caches in
> qemu is not going to be easy, but the kernel at least does have a
> CONFIG_CACHE_OFF which you can select for qemu. If there is a page we can
> test somewhere to figure out if we are under emulation we can likewise
> just turn them off directly at boot.

Does testing if atomic movca+ocbi stores leak to memory count? :)
No but what if qemu would hardwire the cache enable bits when reading
CCR to always be zero?

> Note that qemu also needs to be aware of the movca behavioural
> differences between cache enabled and disabled.

Good point. Caches in writethrough mode should maybe also be considered.

Sorry for the long reply...
Cheers
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Edgar E. Iglesias Jan. 13, 2009, 2:57 a.m. UTC | #6
On Sun, Jan 11, 2009 at 11:42:59AM +0100, Edgar E. Iglesias wrote:
> On Sun, Jan 11, 2009 at 12:58:12PM +0900, Shin-ichiro KAWASAKI wrote:
> >
> > Edgar E. Iglesias wrote:
> >> Another solution might be for linux to use a ocpb followed by a ocpi insn
> >> on the line. IIUC that should achieve the same results net results.
> > I'm not sure about it.  But I think we should not modify linux,
> 
> I agree.
> 
> The ocpb followed by the ocpi that I suggested won't work and I cant
> think of anything better than what linux is already doing.

I gave this some more thought and I think that there might be room for
improvements in the cache-sh4 flushing loops anyway.

The reason why the suggested ocbp+ocbi sequences didn't work was because
I later noticed that the same loop was beeing used for unconditional line
flushes aswell as for flushes for ranges where you actually know the
virtual addresses you want to flush (complete page flushes seem to be
treated differently).

If you separate the two, the flushes for virtual ranges can be done with
the ocbp+ocbi sequence. If I'm not misstaken, there are two main
advantages:

1. You will flush and invalidate much fewer lines, only those that
   actually hit in the cache. This could turn out to be a pretty
   significant win.
2. You get rid of the atomic requirements. It means you can do all the
   ranged flushes with interrupts enabled all the time.

Best regards
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/arch/sh/mm/cache-sh4.c b/arch/sh/mm/cache-sh4.c
index 5cfe08d..4042c8c 100644
--- a/arch/sh/mm/cache-sh4.c
+++ b/arch/sh/mm/cache-sh4.c
@@ -612,6 +612,9 @@  static void __flush_dcache_segment_1way(unsigned long start,
 
 	a0 = base_addr;
 	a0e = base_addr + extent_per_way;
+	if (a0e > ((unsigned long)&empty_zero_page[0]) + PAGE_SIZE) {
+	    a0e = ((unsigned long)&empty_zero_page[0]) + PAGE_SIZE;
+	}
 	do {
 		asm volatile("ldc %0, sr" : : "r" (sr_with_bl));
 		asm volatile("movca.l r0, @%0\n\t"