Message ID | 20210116032740.873-5-thunder.leizhen@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ARM: Add support for Hisilicon Kunpeng L3 cache controller | expand |
On Sat, Jan 16, 2021 at 4:27 AM Zhen Lei <thunder.leizhen@huawei.com> wrote: > diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile > + > +static void l3cache_maint_common(u32 range, u32 op_type) > +{ > + u32 reg; > + > + reg = readl(l3_ctrl_base + L3_MAINT_CTRL); > + reg &= ~(L3_MAINT_RANGE_MASK | L3_MAINT_TYPE_MASK); > + reg |= range | op_type; > + reg |= L3_MAINT_STATUS_START; > + writel(reg, l3_ctrl_base + L3_MAINT_CTRL); Are there contents of L3_MAINT_CTRL that need to be preserved across calls and can not be inferred? A 'readl()' is often expensive, so it might be more efficient if you can avoid that. > +static inline void l3cache_flush_all_nolock(void) > +{ > + l3cache_maint_common(L3_MAINT_RANGE_ALL, L3_MAINT_TYPE_FLUSH); > +} > + > +static void l3cache_flush_all(void) > +{ > + unsigned long flags; > + > + spin_lock_irqsave(&l3cache_lock, flags); > + l3cache_flush_all_nolock(); > + spin_unlock_irqrestore(&l3cache_lock, flags); > +} I see that cache-l2x0 uses raw_spin_lock_irqsave() instead of spin_lock_irqsave(), to avoid preemption in the middle of a cache operation. This is probably a good idea here as well. I also see that l2x0 uses readl_relaxed(), to avoid a deadlock in l2x0_cache_sync(). This may also be beneficial for performance reasons, so it might be helpful to compare performance overhead. On the other hand, readl()/writel() are usually the safe choice, as those avoid the need to argue over whether the relaxed versions are safe in all corner cases. > +static int __init l3cache_init(void) > +{ > + u32 reg; > + struct device_node *node; > + > + node = of_find_matching_node(NULL, l3cache_ids); > + if (!node) > + return -ENODEV; I think the initcall should return '0' to indicate success when running a kernel with this driver built-in on a platform that does not have this device. > diff --git a/arch/arm/mm/cache-kunpeng-l3.h b/arch/arm/mm/cache-kunpeng-l3.h > new file mode 100644 > index 000000000000000..9ef6a53e7d4db49 > --- /dev/null > +++ b/arch/arm/mm/cache-kunpeng-l3.h > @@ -0,0 +1,30 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef __CACHE_KUNPENG_L3_H > +#define __CACHE_KUNPENG_L3_H > + > +#define L3_CACHE_LINE_SHITF 6 I would suggest moving the contents of the header file into the .c file, since there is only a single user of these macros. Arnd
On 2021/1/28 22:24, Arnd Bergmann wrote: > On Sat, Jan 16, 2021 at 4:27 AM Zhen Lei <thunder.leizhen@huawei.com> wrote: >> diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile >> + >> +static void l3cache_maint_common(u32 range, u32 op_type) >> +{ >> + u32 reg; >> + >> + reg = readl(l3_ctrl_base + L3_MAINT_CTRL); >> + reg &= ~(L3_MAINT_RANGE_MASK | L3_MAINT_TYPE_MASK); >> + reg |= range | op_type; >> + reg |= L3_MAINT_STATUS_START; >> + writel(reg, l3_ctrl_base + L3_MAINT_CTRL); > > Are there contents of L3_MAINT_CTRL that need to be preserved > across calls and can not be inferred? A 'readl()' is often expensive, > so it might be more efficient if you can avoid that. Right, this readl() can be replaced with readl_relaxed(). Thanks. I'll check and correct the readl() and writel() in other places. > >> +static inline void l3cache_flush_all_nolock(void) >> +{ >> + l3cache_maint_common(L3_MAINT_RANGE_ALL, L3_MAINT_TYPE_FLUSH); >> +} >> + >> +static void l3cache_flush_all(void) >> +{ >> + unsigned long flags; >> + >> + spin_lock_irqsave(&l3cache_lock, flags); >> + l3cache_flush_all_nolock(); >> + spin_unlock_irqrestore(&l3cache_lock, flags); >> +} > > I see that cache-l2x0 uses raw_spin_lock_irqsave() instead of > spin_lock_irqsave(), to avoid preemption in the middle of a cache > operation. This is probably a good idea here as well. I don't think there's any essential difference between the two! I don't know if the compiler or tool will do anything extra. I checked the git log of the l2x0 driver and it used raw_spin_lock_irqsave() at the beginning. Maybe there's a description in 2.6. Since you mentioned this potential risk, I'll change it to raw_spin_lock_irqsave. include/linux/spinlock.h: static __always_inline raw_spinlock_t *spinlock_check(spinlock_t *lock) { return &lock->rlock; } #define spin_lock_irqsave(lock, flags) \ do { \ raw_spin_lock_irqsave(spinlock_check(lock), flags); \ } while (0) > > I also see that l2x0 uses readl_relaxed(), to avoid a deadlock > in l2x0_cache_sync(). This may also be beneficial for performance > reasons, so it might be helpful to compare performance > overhead. On the other hand, readl()/writel() are usually the > safe choice, as those avoid the need to argue over whether > the relaxed versions are safe in all corner cases. > >> +static int __init l3cache_init(void) >> +{ >> + u32 reg; >> + struct device_node *node; >> + >> + node = of_find_matching_node(NULL, l3cache_ids); >> + if (!node) >> + return -ENODEV; > > I think the initcall should return '0' to indicate success when running > a kernel with this driver built-in on a platform that does not have > this device. I have added "depends on ARCH_KUNPENG50X" for this driver. But it's OK to return 0. > >> diff --git a/arch/arm/mm/cache-kunpeng-l3.h b/arch/arm/mm/cache-kunpeng-l3.h >> new file mode 100644 >> index 000000000000000..9ef6a53e7d4db49 >> --- /dev/null >> +++ b/arch/arm/mm/cache-kunpeng-l3.h >> @@ -0,0 +1,30 @@ >> +/* SPDX-License-Identifier: GPL-2.0 */ >> +#ifndef __CACHE_KUNPENG_L3_H >> +#define __CACHE_KUNPENG_L3_H >> + >> +#define L3_CACHE_LINE_SHITF 6 > > I would suggest moving the contents of the header file into the .c file, > since there is only a single user of these macros. Okay, I'll move it. > > Arnd > > . >
On Fri, Jan 29, 2021 at 8:23 AM Leizhen (ThunderTown) <thunder.leizhen@huawei.com> wrote: > On 2021/1/28 22:24, Arnd Bergmann wrote: > > On Sat, Jan 16, 2021 at 4:27 AM Zhen Lei <thunder.leizhen@huawei.com> wrote: > >> diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile > >> + > >> +static void l3cache_maint_common(u32 range, u32 op_type) > >> +{ > >> + u32 reg; > >> + > >> + reg = readl(l3_ctrl_base + L3_MAINT_CTRL); > >> + reg &= ~(L3_MAINT_RANGE_MASK | L3_MAINT_TYPE_MASK); > >> + reg |= range | op_type; > >> + reg |= L3_MAINT_STATUS_START; > >> + writel(reg, l3_ctrl_base + L3_MAINT_CTRL); > > > > Are there contents of L3_MAINT_CTRL that need to be preserved > > across calls and can not be inferred? A 'readl()' is often expensive, > > so it might be more efficient if you can avoid that. > > Right, this readl() can be replaced with readl_relaxed(). Thanks. > > I'll check and correct the readl() and writel() in other places. What I meant is that if you want to replace them, you should provide performance numbers that show how much difference this makes and add comments in the source code explaining how you proved that the _relaxed() version is actually correct. > >> +static inline void l3cache_flush_all_nolock(void) > >> +{ > >> + l3cache_maint_common(L3_MAINT_RANGE_ALL, L3_MAINT_TYPE_FLUSH); > >> +} > >> + > >> +static void l3cache_flush_all(void) > >> +{ > >> + unsigned long flags; > >> + > >> + spin_lock_irqsave(&l3cache_lock, flags); > >> + l3cache_flush_all_nolock(); > >> + spin_unlock_irqrestore(&l3cache_lock, flags); > >> +} > > > > I see that cache-l2x0 uses raw_spin_lock_irqsave() instead of > > spin_lock_irqsave(), to avoid preemption in the middle of a cache > > operation. This is probably a good idea here as well. > > I don't think there's any essential difference between the two! I don't know > if the compiler or tool will do anything extra. I checked the git log of the > l2x0 driver and it used raw_spin_lock_irqsave() at the beginning. Maybe > there's a description in 2.6. Since you mentioned this potential risk, I'll > change it to raw_spin_lock_irqsave. > include/linux/spinlock.h: > static __always_inline raw_spinlock_t *spinlock_check(spinlock_t *lock) > { > return &lock->rlock; > } > > #define spin_lock_irqsave(lock, flags) \ > do { \ > raw_spin_lock_irqsave(spinlock_check(lock), flags); \ > } while (0) The spin_lock_irqsave() definition is one of the things that differs with CONFIG_PREEMPT_RT=y, where it uses a mutex instead. See https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/log/?h=linux-5.11.y-rt-rebase for the final missing patches including the one that changes the spinlock definition and some patches that change a few other spin_lock to raw_spin_lock. > >> +static int __init l3cache_init(void) > >> +{ > >> + u32 reg; > >> + struct device_node *node; > >> + > >> + node = of_find_matching_node(NULL, l3cache_ids); > >> + if (!node) > >> + return -ENODEV; > > > > I think the initcall should return '0' to indicate success when running > > a kernel with this driver built-in on a platform that does not have > > this device. > > I have added "depends on ARCH_KUNPENG50X" for this driver. But it's OK to > return 0. Note that the "depends on ARCH_KUNPENG50X" is not relevant here, since it only prevents you from enabling the driver on kernels that explicitly exclude the kunpeng platform, but it has no significance to what you are actually running on. The "multi_v7_defconfig" file should have all actively maintained armv7 platforms enabled, similar to how common distros ship their kernels. Arnd
On Fri, Jan 29, 2021 at 03:23:27PM +0800, Leizhen (ThunderTown) wrote: > On 2021/1/28 22:24, Arnd Bergmann wrote: > > I see that cache-l2x0 uses raw_spin_lock_irqsave() instead of > > spin_lock_irqsave(), to avoid preemption in the middle of a cache > > operation. This is probably a good idea here as well. > > I don't think there's any essential difference between the two! I don't know > if the compiler or tool will do anything extra. I checked the git log of the > l2x0 driver and it used raw_spin_lock_irqsave() at the beginning. Maybe > there's a description in 2.6. Since you mentioned this potential risk, I'll > change it to raw_spin_lock_irqsave. See bd31b85960a7 ("locking, ARM: Annotate low level hw locks as raw")
On Fri, Jan 29, 2021 at 9:16 AM Arnd Bergmann <arnd@kernel.org> wrote: > On Fri, Jan 29, 2021 at 8:23 AM Leizhen (ThunderTown) > <thunder.leizhen@huawei.com> wrote: > > On 2021/1/28 22:24, Arnd Bergmann wrote: > > > On Sat, Jan 16, 2021 at 4:27 AM Zhen Lei <thunder.leizhen@huawei.com> wrote: > > >> diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile > > >> + > > >> +static void l3cache_maint_common(u32 range, u32 op_type) > > >> +{ > > >> + u32 reg; > > >> + > > >> + reg = readl(l3_ctrl_base + L3_MAINT_CTRL); > > >> + reg &= ~(L3_MAINT_RANGE_MASK | L3_MAINT_TYPE_MASK); > > >> + reg |= range | op_type; > > >> + reg |= L3_MAINT_STATUS_START; > > >> + writel(reg, l3_ctrl_base + L3_MAINT_CTRL); > > > > > > Are there contents of L3_MAINT_CTRL that need to be preserved > > > across calls and can not be inferred? A 'readl()' is often expensive, > > > so it might be more efficient if you can avoid that. > > > > Right, this readl() can be replaced with readl_relaxed(). Thanks. > > > > I'll check and correct the readl() and writel() in other places. > > What I meant is that if you want to replace them, you should provide > performance numbers that show how much difference this makes > and add comments in the source code explaining how you proved that > the _relaxed() version is actually correct. Another clarification, as there are actually two independent points here: * if you can completely remove the readl() above and just write a hardcoded value into the register, or perhaps read the original value once at boot time, that is probably a win because it avoids one of the barriers in the beginning. The datasheet should tell you if there are any bits in the register that have to be preserved * Regarding the _relaxed() accessors, it's a lot harder to know whether that is safe, as you first have to show, in particular in case any of the accesses stop being guarded by the spinlock in that case, and whether there may be a case where you have to serialize the memory access against accesses that are still in the store queue or prefetched. Whether this matters at all depends mostly on the type of devices you are driving on your SoC. If you have any high-speed network interfaces that are unable to do cache coherent DMA, any extra instruction here may impact the number of packets you can transfer, but if all your high-speed devices are connected to a coherent interconnect, I would just go with the obvious approach and use the safe MMIO accessors everywhere. Arnd
On Fri, Jan 29, 2021 at 11:26:38AM +0100, Arnd Bergmann wrote: > Another clarification, as there are actually two independent > points here: > > * if you can completely remove the readl() above and just write a > hardcoded value into the register, or perhaps read the original > value once at boot time, that is probably a win because it > avoids one of the barriers in the beginning. The datasheet should > tell you if there are any bits in the register that have to be > preserved > > * Regarding the _relaxed() accessors, it's a lot harder to know > whether that is safe, as you first have to show, in particular in case > any of the accesses stop being guarded by the spinlock in that > case, and whether there may be a case where you have to > serialize the memory access against accesses that are still in the > store queue or prefetched. > > Whether this matters at all depends mostly on the type of devices > you are driving on your SoC. If you have any high-speed network > interfaces that are unable to do cache coherent DMA, any extra > instruction here may impact the number of packets you can transfer, > but if all your high-speed devices are connected to a coherent > interconnect, I would just go with the obvious approach and use > the safe MMIO accessors everywhere. For L2 cache code, I would say the opposite, actually, because it is all too easy to get into a deadlock otherwise. If you implement the sync callback, that will be called from every non-relaxed accessor, which means if you need to take some kind of lock in the sync callback and elsewhere in the L2 cache code, you will definitely deadlock. It is safer to put explicit barriers where it is necessary. Also remember that the barrier in readl() etc is _after_ the read, not before, and the barrier in writel() is _before_ the write, not after. The point is to ensure that DMA memory accesses are properly ordered with the IO-accessing instructions. So, using readl_relaxed() with a read-modify-write is entirely sensible provided you do not access DMA memory inbetween.
On Fri, Jan 29, 2021 at 11:33 AM Russell King - ARM Linux admin <linux@armlinux.org.uk> wrote: > > On Fri, Jan 29, 2021 at 11:26:38AM +0100, Arnd Bergmann wrote: > > Another clarification, as there are actually two independent > > points here: > > > > * if you can completely remove the readl() above and just write a > > hardcoded value into the register, or perhaps read the original > > value once at boot time, that is probably a win because it > > avoids one of the barriers in the beginning. The datasheet should > > tell you if there are any bits in the register that have to be > > preserved > > > > * Regarding the _relaxed() accessors, it's a lot harder to know > > whether that is safe, as you first have to show, in particular in case > > any of the accesses stop being guarded by the spinlock in that > > case, and whether there may be a case where you have to > > serialize the memory access against accesses that are still in the > > store queue or prefetched. > > > > Whether this matters at all depends mostly on the type of devices > > you are driving on your SoC. If you have any high-speed network > > interfaces that are unable to do cache coherent DMA, any extra > > instruction here may impact the number of packets you can transfer, > > but if all your high-speed devices are connected to a coherent > > interconnect, I would just go with the obvious approach and use > > the safe MMIO accessors everywhere. > > For L2 cache code, I would say the opposite, actually, because it is > all too easy to get into a deadlock otherwise. > > If you implement the sync callback, that will be called from every > non-relaxed accessor, which means if you need to take some kind of > lock in the sync callback and elsewhere in the L2 cache code, you will > definitely deadlock. Fair enough. I mentioned the sync callback as the reason for using the relaxed accessor in l2x0 in my first reply. Clearly if there was a sync callback here, it would immediately deadlock when calling back into sync() from readl()/writel(). > It is safer to put explicit barriers where it is necessary. > > Also remember that the barrier in readl() etc is _after_ the read, not > before, and the barrier in writel() is _before_ the write, not after. > The point is to ensure that DMA memory accesses are properly ordered > with the IO-accessing instructions. > > So, using readl_relaxed() with a read-modify-write is entirely sensible > provided you do not access DMA memory inbetween. The part I was not sure about is what happens when you have a store to memory immediately before flushing the cache, and there are no barriers inbetween. Is there a possibility for the mmio store to cause the cache to be flushed before the prior memory store has made it into the cache? My guess would be that this cannot happen, but I'm not sure. If the code gets changed to raw_writel(), I think this should be documented next to the actual raw_writel(), explaining either the presence of the absence of such a barrier. Arnd
On Fri, Jan 29, 2021 at 11:53:20AM +0100, Arnd Bergmann wrote: > On Fri, Jan 29, 2021 at 11:33 AM Russell King - ARM Linux admin > <linux@armlinux.org.uk> wrote: > > It is safer to put explicit barriers where it is necessary. > > > > Also remember that the barrier in readl() etc is _after_ the read, not > > before, and the barrier in writel() is _before_ the write, not after. > > The point is to ensure that DMA memory accesses are properly ordered > > with the IO-accessing instructions. > > > > So, using readl_relaxed() with a read-modify-write is entirely sensible > > provided you do not access DMA memory inbetween. > > The part I was not sure about is what happens when you have > a store to memory immediately before flushing the cache, and there > are no barriers inbetween. If the caches are non-coherent, we have to flush L1 before we flush L2 to ensure writebacks get pushed out properly, and L1 will already have the necessary barriers. If we have the situation where L1 is coherent but L2 isn't, then I think we have an "interesting situation" that we haven't considered whether it be in DT or elsewhere.
On 2021/1/29 16:16, Arnd Bergmann wrote: > On Fri, Jan 29, 2021 at 8:23 AM Leizhen (ThunderTown) > <thunder.leizhen@huawei.com> wrote: >> On 2021/1/28 22:24, Arnd Bergmann wrote: >>> On Sat, Jan 16, 2021 at 4:27 AM Zhen Lei <thunder.leizhen@huawei.com> wrote: >>>> diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile >>>> + >>>> +static void l3cache_maint_common(u32 range, u32 op_type) >>>> +{ >>>> + u32 reg; >>>> + >>>> + reg = readl(l3_ctrl_base + L3_MAINT_CTRL); >>>> + reg &= ~(L3_MAINT_RANGE_MASK | L3_MAINT_TYPE_MASK); >>>> + reg |= range | op_type; >>>> + reg |= L3_MAINT_STATUS_START; >>>> + writel(reg, l3_ctrl_base + L3_MAINT_CTRL); >>> >>> Are there contents of L3_MAINT_CTRL that need to be preserved >>> across calls and can not be inferred? A 'readl()' is often expensive, >>> so it might be more efficient if you can avoid that. >> >> Right, this readl() can be replaced with readl_relaxed(). Thanks. >> >> I'll check and correct the readl() and writel() in other places. > > What I meant is that if you want to replace them, you should provide > performance numbers that show how much difference this makes > and add comments in the source code explaining how you proved that > the _relaxed() version is actually correct. Yes, it's just a theoretical analysis now. After the modification, I'll test it. But then again, outcache operations are not critical path of performance. > >>>> +static inline void l3cache_flush_all_nolock(void) >>>> +{ >>>> + l3cache_maint_common(L3_MAINT_RANGE_ALL, L3_MAINT_TYPE_FLUSH); >>>> +} >>>> + >>>> +static void l3cache_flush_all(void) >>>> +{ >>>> + unsigned long flags; >>>> + >>>> + spin_lock_irqsave(&l3cache_lock, flags); >>>> + l3cache_flush_all_nolock(); >>>> + spin_unlock_irqrestore(&l3cache_lock, flags); >>>> +} >>> >>> I see that cache-l2x0 uses raw_spin_lock_irqsave() instead of >>> spin_lock_irqsave(), to avoid preemption in the middle of a cache >>> operation. This is probably a good idea here as well. >> >> I don't think there's any essential difference between the two! I don't know >> if the compiler or tool will do anything extra. I checked the git log of the >> l2x0 driver and it used raw_spin_lock_irqsave() at the beginning. Maybe >> there's a description in 2.6. Since you mentioned this potential risk, I'll >> change it to raw_spin_lock_irqsave. > >> include/linux/spinlock.h: >> static __always_inline raw_spinlock_t *spinlock_check(spinlock_t *lock) >> { >> return &lock->rlock; >> } >> >> #define spin_lock_irqsave(lock, flags) \ >> do { \ >> raw_spin_lock_irqsave(spinlock_check(lock), flags); \ >> } while (0) > > The spin_lock_irqsave() definition is one of the things that differs > with CONFIG_PREEMPT_RT=y, where it uses a mutex instead. > > See https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/log/?h=linux-5.11.y-rt-rebase > for the final missing patches including the one that changes the > spinlock definition and some patches that change a few other spin_lock > to raw_spin_lock. OK, thanks. > >>>> +static int __init l3cache_init(void) >>>> +{ >>>> + u32 reg; >>>> + struct device_node *node; >>>> + >>>> + node = of_find_matching_node(NULL, l3cache_ids); >>>> + if (!node) >>>> + return -ENODEV; >>> >>> I think the initcall should return '0' to indicate success when running >>> a kernel with this driver built-in on a platform that does not have >>> this device. >> >> I have added "depends on ARCH_KUNPENG50X" for this driver. But it's OK to >> return 0. > > Note that the "depends on ARCH_KUNPENG50X" is not relevant here, since > it only prevents you from enabling the driver on kernels that explicitly exclude > the kunpeng platform, but it has no significance to what you are actually > running on. The "multi_v7_defconfig" file should have all actively maintained > armv7 platforms enabled, similar to how common distros ship their kernels. OK, I got it. > > Arnd > > . >
On 2021/1/29 18:12, Russell King - ARM Linux admin wrote: > On Fri, Jan 29, 2021 at 03:23:27PM +0800, Leizhen (ThunderTown) wrote: >> On 2021/1/28 22:24, Arnd Bergmann wrote: >>> I see that cache-l2x0 uses raw_spin_lock_irqsave() instead of >>> spin_lock_irqsave(), to avoid preemption in the middle of a cache >>> operation. This is probably a good idea here as well. >> >> I don't think there's any essential difference between the two! I don't know >> if the compiler or tool will do anything extra. I checked the git log of the >> l2x0 driver and it used raw_spin_lock_irqsave() at the beginning. Maybe >> there's a description in 2.6. Since you mentioned this potential risk, I'll >> change it to raw_spin_lock_irqsave. > > See bd31b85960a7 ("locking, ARM: Annotate low level hw locks as raw") OK, thanks. >
On 2021/1/29 18:26, Arnd Bergmann wrote: > On Fri, Jan 29, 2021 at 9:16 AM Arnd Bergmann <arnd@kernel.org> wrote: >> On Fri, Jan 29, 2021 at 8:23 AM Leizhen (ThunderTown) >> <thunder.leizhen@huawei.com> wrote: >>> On 2021/1/28 22:24, Arnd Bergmann wrote: >>>> On Sat, Jan 16, 2021 at 4:27 AM Zhen Lei <thunder.leizhen@huawei.com> wrote: >>>>> diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile >>>>> + >>>>> +static void l3cache_maint_common(u32 range, u32 op_type) >>>>> +{ >>>>> + u32 reg; >>>>> + >>>>> + reg = readl(l3_ctrl_base + L3_MAINT_CTRL); >>>>> + reg &= ~(L3_MAINT_RANGE_MASK | L3_MAINT_TYPE_MASK); >>>>> + reg |= range | op_type; >>>>> + reg |= L3_MAINT_STATUS_START; >>>>> + writel(reg, l3_ctrl_base + L3_MAINT_CTRL); >>>> >>>> Are there contents of L3_MAINT_CTRL that need to be preserved >>>> across calls and can not be inferred? A 'readl()' is often expensive, >>>> so it might be more efficient if you can avoid that. >>> >>> Right, this readl() can be replaced with readl_relaxed(). Thanks. >>> >>> I'll check and correct the readl() and writel() in other places. >> >> What I meant is that if you want to replace them, you should provide >> performance numbers that show how much difference this makes >> and add comments in the source code explaining how you proved that >> the _relaxed() version is actually correct. > > Another clarification, as there are actually two independent > points here: > > * if you can completely remove the readl() above and just write a > hardcoded value into the register, or perhaps read the original > value once at boot time, that is probably a win because it > avoids one of the barriers in the beginning. The datasheet should > tell you if there are any bits in the register that have to be > preserved Code coupling will become very strong. > > * Regarding the _relaxed() accessors, it's a lot harder to know > whether that is safe, as you first have to show, in particular in case > any of the accesses stop being guarded by the spinlock in that > case, and whether there may be a case where you have to > serialize the memory access against accesses that are still in the > store queue or prefetched. > > Whether this matters at all depends mostly on the type of devices > you are driving on your SoC. If you have any high-speed network > interfaces that are unable to do cache coherent DMA, any extra > instruction here may impact the number of packets you can transfer, > but if all your high-speed devices are connected to a coherent > interconnect, I would just go with the obvious approach and use > the safe MMIO accessors everywhere. In fact, this driver has been running on an earlier version for several years and has not received any feedback about the performance issue. So I didn't try to optimize it when I first sent these patches. I had to reconsider it until you noticed it. How about keeping it unchanged for the moment? It'll take a lot of time and energy to retest. > > Arnd > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > >
On 2021/1/29 18:33, Russell King - ARM Linux admin wrote: > On Fri, Jan 29, 2021 at 11:26:38AM +0100, Arnd Bergmann wrote: >> Another clarification, as there are actually two independent >> points here: >> >> * if you can completely remove the readl() above and just write a >> hardcoded value into the register, or perhaps read the original >> value once at boot time, that is probably a win because it >> avoids one of the barriers in the beginning. The datasheet should >> tell you if there are any bits in the register that have to be >> preserved >> >> * Regarding the _relaxed() accessors, it's a lot harder to know >> whether that is safe, as you first have to show, in particular in case >> any of the accesses stop being guarded by the spinlock in that >> case, and whether there may be a case where you have to >> serialize the memory access against accesses that are still in the >> store queue or prefetched. >> >> Whether this matters at all depends mostly on the type of devices >> you are driving on your SoC. If you have any high-speed network >> interfaces that are unable to do cache coherent DMA, any extra >> instruction here may impact the number of packets you can transfer, >> but if all your high-speed devices are connected to a coherent >> interconnect, I would just go with the obvious approach and use >> the safe MMIO accessors everywhere. > > For L2 cache code, I would say the opposite, actually, because it is > all too easy to get into a deadlock otherwise. > > If you implement the sync callback, that will be called from every > non-relaxed accessor, which means if you need to take some kind of > lock in the sync callback and elsewhere in the L2 cache code, you will > definitely deadlock. > > It is safer to put explicit barriers where it is necessary. > > Also remember that the barrier in readl() etc is _after_ the read, not > before, and the barrier in writel() is _before_ the write, not after. > The point is to ensure that DMA memory accesses are properly ordered > with the IO-accessing instructions. Yes, I known it. writel() must be used for the write operations that control "start/stop" or "enable/disable" function, to ensure that the data of previous write operations reaches the target. I've met this kind of problem before. > > So, using readl_relaxed() with a read-modify-write is entirely sensible > provided you do not access DMA memory inbetween. Actually, I don't think this register is that complicated. I copied the code back below. All the bits of L3_MAINT_CTRL are not affected by DMA access operations. The software change the "range | op_type" to specify the operation type and scope, the set the bit "L3_MAINT_STATUS_START" to start the operation. Then wait for that bit to change from 1 to 0 by hardware. + reg = readl(l3_ctrl_base + L3_MAINT_CTRL); + reg &= ~(L3_MAINT_RANGE_MASK | L3_MAINT_TYPE_MASK); + reg |= range | op_type; + reg |= L3_MAINT_STATUS_START; + writel(reg, l3_ctrl_base + L3_MAINT_CTRL); + + /* Wait until the hardware maintenance operation is complete. */ + do { + cpu_relax(); + reg = readl(l3_ctrl_base + L3_MAINT_CTRL); + } while ((reg & L3_MAINT_STATUS_MASK) != L3_MAINT_STATUS_END); >
On 2021/1/29 21:54, Leizhen (ThunderTown) wrote: > > > On 2021/1/29 18:26, Arnd Bergmann wrote: >> On Fri, Jan 29, 2021 at 9:16 AM Arnd Bergmann <arnd@kernel.org> wrote: >>> On Fri, Jan 29, 2021 at 8:23 AM Leizhen (ThunderTown) >>> <thunder.leizhen@huawei.com> wrote: >>>> On 2021/1/28 22:24, Arnd Bergmann wrote: >>>>> On Sat, Jan 16, 2021 at 4:27 AM Zhen Lei <thunder.leizhen@huawei.com> wrote: >>>>>> diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile >>>>>> + >>>>>> +static void l3cache_maint_common(u32 range, u32 op_type) >>>>>> +{ >>>>>> + u32 reg; >>>>>> + >>>>>> + reg = readl(l3_ctrl_base + L3_MAINT_CTRL); >>>>>> + reg &= ~(L3_MAINT_RANGE_MASK | L3_MAINT_TYPE_MASK); >>>>>> + reg |= range | op_type; >>>>>> + reg |= L3_MAINT_STATUS_START; >>>>>> + writel(reg, l3_ctrl_base + L3_MAINT_CTRL); >>>>> >>>>> Are there contents of L3_MAINT_CTRL that need to be preserved >>>>> across calls and can not be inferred? A 'readl()' is often expensive, >>>>> so it might be more efficient if you can avoid that. >>>> >>>> Right, this readl() can be replaced with readl_relaxed(). Thanks. >>>> >>>> I'll check and correct the readl() and writel() in other places. >>> >>> What I meant is that if you want to replace them, you should provide >>> performance numbers that show how much difference this makes >>> and add comments in the source code explaining how you proved that >>> the _relaxed() version is actually correct. >> >> Another clarification, as there are actually two independent >> points here: >> >> * if you can completely remove the readl() above and just write a >> hardcoded value into the register, or perhaps read the original >> value once at boot time, that is probably a win because it >> avoids one of the barriers in the beginning. The datasheet should >> tell you if there are any bits in the register that have to be >> preserved > > Code coupling will become very strong. > >> >> * Regarding the _relaxed() accessors, it's a lot harder to know >> whether that is safe, as you first have to show, in particular in case >> any of the accesses stop being guarded by the spinlock in that >> case, and whether there may be a case where you have to >> serialize the memory access against accesses that are still in the >> store queue or prefetched. >> >> Whether this matters at all depends mostly on the type of devices >> you are driving on your SoC. If you have any high-speed network >> interfaces that are unable to do cache coherent DMA, any extra >> instruction here may impact the number of packets you can transfer, >> but if all your high-speed devices are connected to a coherent >> interconnect, I would just go with the obvious approach and use >> the safe MMIO accessors everywhere. > > In fact, this driver has been running on an earlier version for several years > and has not received any feedback about the performance issue. So I didn't > try to optimize it when I first sent these patches. I had to reconsider it > until you noticed it. > > How about keeping it unchanged for the moment? It'll take a lot of time and > energy to retest. In the spirit of code excellence, it's still necessary to optimize it. Yesterday, my family urged me to go back, I wrote it in a hurry. > >> >> Arnd >> >> _______________________________________________ >> linux-arm-kernel mailing list >> linux-arm-kernel@lists.infradead.org >> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel >> >> > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > > . >
diff --git a/arch/arm/mm/Kconfig b/arch/arm/mm/Kconfig index 02692fbe2db5c59..d2082503de053d2 100644 --- a/arch/arm/mm/Kconfig +++ b/arch/arm/mm/Kconfig @@ -1070,6 +1070,16 @@ config CACHE_XSC3L2 help This option enables the L2 cache on XScale3. +config CACHE_KUNPENG_L3 + bool "Enable the Hisilicon Kunpeng L3 cache controller" + depends on ARCH_KUNPENG50X && OF + default y + select OUTER_CACHE + help + This option enables the Kunpeng L3 cache controller on Hisilicon + Kunpeng506 and Kunpeng509 SoCs. It supports a maximum of 36-bit + physical addresses. + config ARM_L1_CACHE_SHIFT_6 bool default y if CPU_V7 diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile index 3510503bc5e688b..ececc5489e353eb 100644 --- a/arch/arm/mm/Makefile +++ b/arch/arm/mm/Makefile @@ -112,6 +112,7 @@ obj-$(CONFIG_CACHE_L2X0_PMU) += cache-l2x0-pmu.o obj-$(CONFIG_CACHE_XSC3L2) += cache-xsc3l2.o obj-$(CONFIG_CACHE_TAUROS2) += cache-tauros2.o obj-$(CONFIG_CACHE_UNIPHIER) += cache-uniphier.o +obj-$(CONFIG_CACHE_KUNPENG_L3) += cache-kunpeng-l3.o KASAN_SANITIZE_kasan_init.o := n obj-$(CONFIG_KASAN) += kasan_init.o diff --git a/arch/arm/mm/cache-kunpeng-l3.c b/arch/arm/mm/cache-kunpeng-l3.c new file mode 100644 index 000000000000000..cb81f15d26a0cf2 --- /dev/null +++ b/arch/arm/mm/cache-kunpeng-l3.c @@ -0,0 +1,153 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2021 Hisilicon Limited. + */ + +#include <linux/init.h> +#include <linux/spinlock.h> +#include <linux/io.h> +#include <linux/of_address.h> + +#include <asm/cacheflush.h> + +#include "cache-kunpeng-l3.h" + +static DEFINE_SPINLOCK(l3cache_lock); +static void __iomem *l3_ctrl_base; + + +static void l3cache_maint_common(u32 range, u32 op_type) +{ + u32 reg; + + reg = readl(l3_ctrl_base + L3_MAINT_CTRL); + reg &= ~(L3_MAINT_RANGE_MASK | L3_MAINT_TYPE_MASK); + reg |= range | op_type; + reg |= L3_MAINT_STATUS_START; + writel(reg, l3_ctrl_base + L3_MAINT_CTRL); + + /* Wait until the hardware maintenance operation is complete. */ + do { + cpu_relax(); + reg = readl(l3_ctrl_base + L3_MAINT_CTRL); + } while ((reg & L3_MAINT_STATUS_MASK) != L3_MAINT_STATUS_END); +} + +static void l3cache_maint_range(phys_addr_t start, phys_addr_t end, u32 op_type) +{ + start = start >> L3_CACHE_LINE_SHITF; + end = ((end - 1) >> L3_CACHE_LINE_SHITF) + 1; + + writel(start, l3_ctrl_base + L3_MAINT_START); + writel(end, l3_ctrl_base + L3_MAINT_END); + + l3cache_maint_common(L3_MAINT_RANGE_ADDR, op_type); +} + +static inline void l3cache_flush_all_nolock(void) +{ + l3cache_maint_common(L3_MAINT_RANGE_ALL, L3_MAINT_TYPE_FLUSH); +} + +static void l3cache_flush_all(void) +{ + unsigned long flags; + + spin_lock_irqsave(&l3cache_lock, flags); + l3cache_flush_all_nolock(); + spin_unlock_irqrestore(&l3cache_lock, flags); +} + +static void l3cache_inv_range(phys_addr_t start, phys_addr_t end) +{ + unsigned long flags; + + spin_lock_irqsave(&l3cache_lock, flags); + l3cache_maint_range(start, end, L3_MAINT_TYPE_INV); + spin_unlock_irqrestore(&l3cache_lock, flags); +} + +static void l3cache_clean_range(phys_addr_t start, phys_addr_t end) +{ + unsigned long flags; + + spin_lock_irqsave(&l3cache_lock, flags); + l3cache_maint_range(start, end, L3_MAINT_TYPE_CLEAN); + spin_unlock_irqrestore(&l3cache_lock, flags); +} + +static void l3cache_flush_range(phys_addr_t start, phys_addr_t end) +{ + unsigned long flags; + + spin_lock_irqsave(&l3cache_lock, flags); + l3cache_maint_range(start, end, L3_MAINT_TYPE_FLUSH); + spin_unlock_irqrestore(&l3cache_lock, flags); +} + +static void l3cache_disable(void) +{ + unsigned long flags; + + spin_lock_irqsave(&l3cache_lock, flags); + l3cache_flush_all_nolock(); + writel(L3_CTRL_DISABLE, l3_ctrl_base + L3_CTRL); + spin_unlock_irqrestore(&l3cache_lock, flags); +} + +static const struct of_device_id l3cache_ids[] __initconst = { + {.compatible = "hisilicon,kunpeng-l3cache", .data = NULL}, + {} +}; + +static int __init l3cache_init(void) +{ + u32 reg; + struct device_node *node; + + node = of_find_matching_node(NULL, l3cache_ids); + if (!node) + return -ENODEV; + + l3_ctrl_base = of_iomap(node, 0); + if (!l3_ctrl_base) { + pr_err("failed to map Kunpeng L3 cache controller registers\n"); + return -ENOMEM; + } + + reg = readl(l3_ctrl_base + L3_CTRL); + if (!(reg & L3_CTRL_ENABLE)) { + unsigned long flags; + + spin_lock_irqsave(&l3cache_lock, flags); + + /* + * Ensure that no L3 cache hardware maintenance operations are + * being performed before enabling the L3 cache. Wait for it to + * finish. + */ + do { + cpu_relax(); + reg = readl(l3_ctrl_base + L3_MAINT_CTRL); + } while ((reg & L3_MAINT_STATUS_MASK) != L3_MAINT_STATUS_END); + + reg = readl(l3_ctrl_base + L3_AUCTRL); + reg |= L3_AUCTRL_EVENT_EN | L3_AUCTRL_ECC_EN; + writel(reg, l3_ctrl_base + L3_AUCTRL); + + writel(L3_CTRL_ENABLE, l3_ctrl_base + L3_CTRL); + + spin_unlock_irqrestore(&l3cache_lock, flags); + } + + outer_cache.inv_range = l3cache_inv_range; + outer_cache.clean_range = l3cache_clean_range; + outer_cache.flush_range = l3cache_flush_range; + outer_cache.flush_all = l3cache_flush_all; + outer_cache.disable = l3cache_disable; + + pr_info("Hisilicon Kunpeng L3 cache controller enabled\n"); + + return 0; +} +arch_initcall(l3cache_init); diff --git a/arch/arm/mm/cache-kunpeng-l3.h b/arch/arm/mm/cache-kunpeng-l3.h new file mode 100644 index 000000000000000..9ef6a53e7d4db49 --- /dev/null +++ b/arch/arm/mm/cache-kunpeng-l3.h @@ -0,0 +1,30 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __CACHE_KUNPENG_L3_H +#define __CACHE_KUNPENG_L3_H + +#define L3_CACHE_LINE_SHITF 6 + +#define L3_CTRL 0x0 +#define L3_CTRL_ENABLE (1U << 0) +#define L3_CTRL_DISABLE (0U << 0) + +#define L3_AUCTRL 0x4 +#define L3_AUCTRL_EVENT_EN BIT(23) +#define L3_AUCTRL_ECC_EN BIT(8) + +#define L3_MAINT_CTRL 0x20 +#define L3_MAINT_RANGE_MASK GENMASK(3, 3) +#define L3_MAINT_RANGE_ALL (0U << 3) +#define L3_MAINT_RANGE_ADDR (1U << 3) +#define L3_MAINT_TYPE_MASK GENMASK(2, 1) +#define L3_MAINT_TYPE_CLEAN (1U << 1) +#define L3_MAINT_TYPE_INV (2U << 1) +#define L3_MAINT_TYPE_FLUSH (3U << 1) +#define L3_MAINT_STATUS_MASK GENMASK(0, 0) +#define L3_MAINT_STATUS_START (1U << 0) +#define L3_MAINT_STATUS_END (0U << 0) + +#define L3_MAINT_START 0x24 +#define L3_MAINT_END 0x28 + +#endif
Add support for the Hisilicon Kunpeng L3 cache controller as used with Kunpeng506 and Kunpeng509 SoCs. These Hisilicon SoCs support LPAE, so the physical addresses is wider than 32-bits, but the actual bit width does not exceed 36 bits. When the cache operation is performed based on the address range, the upper 30 bits of the physical address are recorded in registers L3_MAINT_START and L3_MAINT_END, and ignore the lower 6 bits cacheline offset. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> --- arch/arm/mm/Kconfig | 10 +++ arch/arm/mm/Makefile | 1 + arch/arm/mm/cache-kunpeng-l3.c | 153 +++++++++++++++++++++++++++++++++ arch/arm/mm/cache-kunpeng-l3.h | 30 +++++++ 4 files changed, 194 insertions(+) create mode 100644 arch/arm/mm/cache-kunpeng-l3.c create mode 100644 arch/arm/mm/cache-kunpeng-l3.h