Message ID | 4968DD28.3030709@juno.dti.ne.jp (mailing list archive) |
---|---|
State | Rejected |
Delegated to: | Paul Mundt |
Headers | show |
On Sun, Jan 11, 2009 at 02:38:48AM +0900, Shin-ichiro KAWASAKI wrote: > Hi, all. > > I'm now working on to expand qemu-sh to emulate > "Solution Engine 7750", and found one odd thing. > Could you give me some advice? > > My SH7750 emulation environment fails to boot up. > I made some investigation and found that, > - the linux kernel for SE7750(se7750_defconfig) flushes > dcache on its boot sequence. > - SH7750's dcache is 16KB and direct-map. > Then 16KB memory region are touched and modified to flush it. > - empty_zero_page is used for this flush, but it only has > 4KB. The text region after it has got broken and causes > boot failure. > > I added a patch against linux kernel to this mail for a reference. > It only reduces the flush region size to 4KB=PAGE_SIZE, but avoids > the problem and let the kernel boot up cleanly. > Of course it is not a good solution, because it does not flush all > caches. > > I wonder two points. > - Does this problem happen on real SE7750 board? Hello, I'm not very familiar with sh arch so please take this with a grain of salt :) It's not entirely clear to me if the bug will show up on silicon, but my guess is that it wont. From my understating of the docs, the movca store will for misses in the cache be processed with a write-validate write-miss policy. That means that the movca store will allocate the line (flushing any previous content if needed) but not fetch any data corresponding to the movca store address. The sh7750 does not have multiple dirty bits per line so that kind of treatment leaves the unwritten parts of the line with unpredictable results. Such insns can be very useful for fast block copies through writeback caches that otherwise do a line fetch for write-misses. So, when the ocbi insn invalidates the line, no write back is done and the downstream busses never see the movca store. I'm not sure how to handle this in qemu without adding cache models. One way to handle this particular cacheflush sequence might be to delay all movca stores until there's another load/store or cache control insn being issued to help you figure out if you can ignore previous movca. That will not by any means cover all cases though. Another solution might be for linux to use a ocpb followed by a ocpi insn on the line. IIUC that should achieve the same results net results. Cheers -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Edgar E. Iglesias wrote: > On Sun, Jan 11, 2009 at 02:38:48AM +0900, Shin-ichiro KAWASAKI wrote: >> Hi, all. >> >> I'm now working on to expand qemu-sh to emulate >> "Solution Engine 7750", and found one odd thing. >> Could you give me some advice? >> >> My SH7750 emulation environment fails to boot up. >> I made some investigation and found that, >> - the linux kernel for SE7750(se7750_defconfig) flushes >> dcache on its boot sequence. >> - SH7750's dcache is 16KB and direct-map. >> Then 16KB memory region are touched and modified to flush it. >> - empty_zero_page is used for this flush, but it only has >> 4KB. The text region after it has got broken and causes >> boot failure. >> >> I added a patch against linux kernel to this mail for a reference. >> It only reduces the flush region size to 4KB=PAGE_SIZE, but avoids >> the problem and let the kernel boot up cleanly. >> Of course it is not a good solution, because it does not flush all >> caches. >> >> I wonder two points. >> - Does this problem happen on real SE7750 board? > > Hello, > > I'm not very familiar with sh arch so please take this with a grain > of salt :) > > It's not entirely clear to me if the bug will show up on silicon, but my > guess is that it wont. > >>From my understating of the docs, the movca store will for misses in the > cache be processed with a write-validate write-miss policy. That means that > the movca store will allocate the line (flushing any previous content if > needed) but not fetch any data corresponding to the movca store address. > The sh7750 does not have multiple dirty bits per line so that kind of > treatment leaves the unwritten parts of the line with unpredictable results. > > Such insns can be very useful for fast block copies through writeback caches > that otherwise do a line fetch for write-misses. > > So, when the ocbi insn invalidates the line, no write back is done and the > downstream busses never see the movca store. Thanks a lot! This explains the situation. I haven't understood what movca does. > I'm not sure how to handle this in qemu without adding cache models. That seems a too big work and might have performance drawback. > One way to handle this particular cacheflush sequence might be to delay all > movca stores until there's another load/store or cache control insn being > issued to help you figure out if you can ignore previous movca. That will > not by any means cover all cases though. It seems a good way to avoid this problem. My current modification plan is as follows. - On executing 'movca', just record the store task which movca should do into CPUStatus. - On executing 'ocbi', delete the store task. - Let TCG produce 'delayed_movca' instruction for the first 'memory touching insn' or 'exception producing insn' after movca. - On executing 'delayed_movca', do the store tasks. > Another solution might be for linux to use a ocpb followed by a ocpi insn > on the line. IIUC that should achieve the same results net results. I'm not sure about it. But I think we should not modify linux, because now I guess that the current linux works on real silicon. Thanks again! Regards, Shin-ichiro KAWASAKI -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jan 11, 2009 at 12:58:12PM +0900, Shin-ichiro KAWASAKI wrote: > > Edgar E. Iglesias wrote: >> Another solution might be for linux to use a ocpb followed by a ocpi insn >> on the line. IIUC that should achieve the same results net results. > I'm not sure about it. But I think we should not modify linux, I agree. The ocpb followed by the ocpi that I suggested won't work and I cant think of anything better than what linux is already doing. Best regards -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jan 11, 2009 at 12:58:12PM +0900, Shin-ichiro KAWASAKI wrote: > Edgar E. Iglesias wrote: > >One way to handle this particular cacheflush sequence might be to delay all > >movca stores until there's another load/store or cache control insn being > >issued to help you figure out if you can ignore previous movca. That will > >not by any means cover all cases though. > It seems a good way to avoid this problem. > My current modification plan is as follows. > - On executing 'movca', just record the store task which movca > should do into CPUStatus. > - On executing 'ocbi', delete the store task. > - Let TCG produce 'delayed_movca' instruction for > the first 'memory touching insn' or 'exception producing insn' > after movca. > - On executing 'delayed_movca', do the store tasks. > There are other ways in which movca is used as well, including with ocbi/ocbwb (SH-X2 and later can also do ocbp), as well as with the movca line being operated on by mov later without any explicit manipulation of the dcache (common behaviour in read paths). If you are going to model it in CPUStatus, you are going to have to effectively have something that spans the size of the cache and watch out for accessors. Not all will be through the dcache modifier instructions, remember that memory-mapped access is also used for flushing in other areas. > >Another solution might be for linux to use a ocpb followed by a ocpi insn > >on the line. IIUC that should achieve the same results net results. > I'm not sure about it. But I think we should not modify linux, > because now I guess that the current linux works on real silicon. > Yes, we do not want to modify linux for this. Implementing real caches in qemu is not going to be easy, but the kernel at least does have a CONFIG_CACHE_OFF which you can select for qemu. If there is a page we can test somewhere to figure out if we are under emulation we can likewise just turn them off directly at boot. Note that qemu also needs to be aware of the movca behavioural differences between cache enabled and disabled. -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 12, 2009 at 09:58:26PM +0900, Paul Mundt wrote: > On Sun, Jan 11, 2009 at 12:58:12PM +0900, Shin-ichiro KAWASAKI wrote: > > Edgar E. Iglesias wrote: > > >One way to handle this particular cacheflush sequence might be to delay all > > >movca stores until there's another load/store or cache control insn being > > >issued to help you figure out if you can ignore previous movca. That will > > >not by any means cover all cases though. > > It seems a good way to avoid this problem. > > My current modification plan is as follows. > > - On executing 'movca', just record the store task which movca > > should do into CPUStatus. > > - On executing 'ocbi', delete the store task. > > - Let TCG produce 'delayed_movca' instruction for > > the first 'memory touching insn' or 'exception producing insn' > > after movca. > > - On executing 'delayed_movca', do the store tasks. > > > There are other ways in which movca is used as well, including with > ocbi/ocbwb (SH-X2 and later can also do ocbp), as well as with the movca > line being operated on by mov later without any explicit manipulation of > the dcache (common behaviour in read paths). Yes but AFAICT the non ocbi cases are already kind of emulated (as good as it gets without having cache models) both before and after the suggested patch. I think there might be a bit of confusion (maybe on my side). IIUC, the issue is not really with the movca insn but more with the ocbi. Actually, I expect any store immediately followed by ocbi to get canceled with the 7750 writeback cache policies (assuming no exceptions break the sequence). The reason for using movca instead of mov when flushing the chaches is only to avoid the unnecessary line fetch in case the mov misses. The tricky part for QEMU to emulate is the ocbi in writeback mode because it means we suddenly might need to ignore previous stores. Luckily most of the ocbi use-cases are not realistic (many even give unpredictable results) and can IMO be left as unsupported features of QEMU. AFAICT, only the movca + ocbi cache flush+invalidate sequences have important use-cases that are currently not handled by QEMU. I don't really want to advocate for a change or not, but if the sh folks think it's important to handle that particular sequence it might be possible with something along the lines with Shin-ichiro's patch. > If you are going to model it in CPUStatus, you are going to have to > effectively have something that spans the size of the cache and watch out > for accessors. Not all will be through the dcache modifier instructions, > remember that memory-mapped access is also used for flushing in other > areas. If I understand the docs correctly the cache will always writeback dirty entries when beeing invalidated in this manner, I think QEMU can ignore these because everything is already in memory by then. This led me to think of another issue, the cacheability bit in the tlb entries. But why would sw do a movca + ocbi to an uncached address? Can probably be ignored, but the suggested patch will incorrectly cancel those stores. > > >Another solution might be for linux to use a ocpb followed by a ocpi insn > > >on the line. IIUC that should achieve the same results net results. > > I'm not sure about it. But I think we should not modify linux, > > because now I guess that the current linux works on real silicon. > > > Yes, we do not want to modify linux for this. Implementing real caches in > qemu is not going to be easy, but the kernel at least does have a > CONFIG_CACHE_OFF which you can select for qemu. If there is a page we can > test somewhere to figure out if we are under emulation we can likewise > just turn them off directly at boot. Does testing if atomic movca+ocbi stores leak to memory count? :) No but what if qemu would hardwire the cache enable bits when reading CCR to always be zero? > Note that qemu also needs to be aware of the movca behavioural > differences between cache enabled and disabled. Good point. Caches in writethrough mode should maybe also be considered. Sorry for the long reply... Cheers -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jan 11, 2009 at 11:42:59AM +0100, Edgar E. Iglesias wrote: > On Sun, Jan 11, 2009 at 12:58:12PM +0900, Shin-ichiro KAWASAKI wrote: > > > > Edgar E. Iglesias wrote: > >> Another solution might be for linux to use a ocpb followed by a ocpi insn > >> on the line. IIUC that should achieve the same results net results. > > I'm not sure about it. But I think we should not modify linux, > > I agree. > > The ocpb followed by the ocpi that I suggested won't work and I cant > think of anything better than what linux is already doing. I gave this some more thought and I think that there might be room for improvements in the cache-sh4 flushing loops anyway. The reason why the suggested ocbp+ocbi sequences didn't work was because I later noticed that the same loop was beeing used for unconditional line flushes aswell as for flushes for ranges where you actually know the virtual addresses you want to flush (complete page flushes seem to be treated differently). If you separate the two, the flushes for virtual ranges can be done with the ocbp+ocbi sequence. If I'm not misstaken, there are two main advantages: 1. You will flush and invalidate much fewer lines, only those that actually hit in the cache. This could turn out to be a pretty significant win. 2. You get rid of the atomic requirements. It means you can do all the ranged flushes with interrupts enabled all the time. Best regards -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/sh/mm/cache-sh4.c b/arch/sh/mm/cache-sh4.c index 5cfe08d..4042c8c 100644 --- a/arch/sh/mm/cache-sh4.c +++ b/arch/sh/mm/cache-sh4.c @@ -612,6 +612,9 @@ static void __flush_dcache_segment_1way(unsigned long start, a0 = base_addr; a0e = base_addr + extent_per_way; + if (a0e > ((unsigned long)&empty_zero_page[0]) + PAGE_SIZE) { + a0e = ((unsigned long)&empty_zero_page[0]) + PAGE_SIZE; + } do { asm volatile("ldc %0, sr" : : "r" (sr_with_bl)); asm volatile("movca.l r0, @%0\n\t"