Message ID | 1347890398-22088-1-git-send-email-robherring2@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Sep 17, 2012 at 08:59:58AM -0500, Rob Herring wrote: > From: Rob Herring <rob.herring@calxeda.com> > > All but background ops are atomic on the pl310, so a spinlock is not > needed for a cache sync if background operations are not used. Using > background ops was an optimization for flushing large buffers, but that's > not needed for platforms where i/o is coherent and/or that have a larger > cache size than likely to flush at once. The cache sync spinlock is > taken on every readl/writel and can be a bottleneck for code paths with > register accesses. > > The default behaviour is unchanged. Platforms can enable using atomic > cache ops only by adding "arm,use-atomic-ops" to pl310 device-tree > node. > > It is assumed that remaining background ops are only used in non-SMP > code paths. (a) you're not describing hardware, you're describing a feature of the Linux kernel's implementation - no, you're describing a configuration feature of the kernel. This should not be in DT. (b) you're disabling the optimization to avoid doing a lengthy line-by-line cache operation when the size is larger than the cache size, and trading it for possibly slightly quicker accesses. The problem I have with this is it's likely that you've only looked at "oh, this makes IO accesses faster" and not the total picture wrt the effect on time taken by the DMA API. Plus, because there's little justification behind this patch (apart from a vague reference to the speed of IO accesses) I'm going to say NAK to this until further information is forthcoming justifying this change. In any case (a) applies whatever. This is not hardware description, this is implementation configuration which has nothing to do with DT. Don't use DT as a dumping ground for implementation configuration.
On 09/17/2012 02:47 PM, Russell King - ARM Linux wrote: > On Mon, Sep 17, 2012 at 08:59:58AM -0500, Rob Herring wrote: >> From: Rob Herring <rob.herring@calxeda.com> >> >> All but background ops are atomic on the pl310, so a spinlock is not >> needed for a cache sync if background operations are not used. Using >> background ops was an optimization for flushing large buffers, but that's >> not needed for platforms where i/o is coherent and/or that have a larger >> cache size than likely to flush at once. The cache sync spinlock is >> taken on every readl/writel and can be a bottleneck for code paths with >> register accesses. >> >> The default behaviour is unchanged. Platforms can enable using atomic >> cache ops only by adding "arm,use-atomic-ops" to pl310 device-tree >> node. >> >> It is assumed that remaining background ops are only used in non-SMP >> code paths. > > (a) you're not describing hardware, you're describing a feature of the > Linux kernel's implementation - no, you're describing a configuration > feature of the kernel. This should not be in DT. > (b) you're disabling the optimization to avoid doing a lengthy line-by-line > cache operation when the size is larger than the cache size, and trading > it for possibly slightly quicker accesses. > > The problem I have with this is it's likely that you've only looked at > "oh, this makes IO accesses faster" and not the total picture wrt the > effect on time taken by the DMA API. No, I'm looking at the system performance and profiling led me here. If i/o is coherent, then L2 cache operations are never called by the DMA mapping API. So we have a optimization using background ops that I will never hit and that optimization gives me a negative performance impact. I could easily "fix" this in the xgmac driver by using __raw_writel or writel_relaxed, but that would break non-coherent operation which does need the L2 cache barriers. Even with non-coherent i/o, I'm not likely to be doing >4MB at a time flushes on my system. It is certainly impossible for networking. I'm not sure about SATA, but I'd guess that size is also unlikely. > Plus, because there's little justification behind this patch (apart > from a vague reference to the speed of IO accesses) I'm going to say > NAK to this until further information is forthcoming justifying this > change. > 30% faster network transmit rate with pktgen on highbank. > In any case (a) applies whatever. This is not hardware description, > this is implementation configuration which has nothing to do with DT. > Don't use DT as a dumping ground for implementation configuration. It's not without precedence, but you are right that it is a questionable use of DT. I'm open to suggestions of how you would expose this configuration to platforms. We obviously don't want a compile time option. Some options: - Add a flags param to l2x0_of_init to configure it - Add non-background version of functions and override the function ptrs in platform code. - Add a new call to configure l2 settings like this. - Expose the use_background_ops variable directly Rob
On 17 September 2012 21:29, Rob Herring <robherring2@gmail.com> wrote: > On 09/17/2012 02:47 PM, Russell King - ARM Linux wrote: >> On Mon, Sep 17, 2012 at 08:59:58AM -0500, Rob Herring wrote: >>> From: Rob Herring <rob.herring@calxeda.com> >>> >>> All but background ops are atomic on the pl310, so a spinlock is not >>> needed for a cache sync if background operations are not used. Using >>> background ops was an optimization for flushing large buffers, but that's >>> not needed for platforms where i/o is coherent and/or that have a larger >>> cache size than likely to flush at once. The cache sync spinlock is >>> taken on every readl/writel and can be a bottleneck for code paths with >>> register accesses. >>> >>> The default behaviour is unchanged. Platforms can enable using atomic >>> cache ops only by adding "arm,use-atomic-ops" to pl310 device-tree >>> node. >>> >>> It is assumed that remaining background ops are only used in non-SMP >>> code paths. >> >> (a) you're not describing hardware, you're describing a feature of the >> Linux kernel's implementation - no, you're describing a configuration >> feature of the kernel. This should not be in DT. >> (b) you're disabling the optimization to avoid doing a lengthy line-by-line >> cache operation when the size is larger than the cache size, and trading >> it for possibly slightly quicker accesses. >> >> The problem I have with this is it's likely that you've only looked at >> "oh, this makes IO accesses faster" and not the total picture wrt the >> effect on time taken by the DMA API. > > No, I'm looking at the system performance and profiling led me here. If > i/o is coherent, then L2 cache operations are never called by the DMA > mapping API. So we have a optimization using background ops that I will > never hit and that optimization gives me a negative performance impact. > > I could easily "fix" this in the xgmac driver by using __raw_writel or > writel_relaxed, but that would break non-coherent operation which does > need the L2 cache barriers. I haven't looked at the xgmac driver but you can use the relaxed accessors entirely in this file with explicit mb() where the barrier is needed. It may give you some improvement, though you still issue an L2 sync as part of mb(). The only issue is that the relaxed accessors are not defined on all architectures, nor the semantics are clear between those that define it. BTW, if you use the streaming DMA API, the barrier should be (or can be added) part of the API implementation. For the coherent API, ideally we should implement the ordering between I/O and DMA operations as part of the API but there isn't such functionality provided (nor driver writers willing to do this explicitly), so we overload the mb() and I/O accessors.
On 09/17/2012 03:45 PM, Catalin Marinas wrote: > On 17 September 2012 21:29, Rob Herring <robherring2@gmail.com> wrote: >> On 09/17/2012 02:47 PM, Russell King - ARM Linux wrote: >>> On Mon, Sep 17, 2012 at 08:59:58AM -0500, Rob Herring wrote: >>>> From: Rob Herring <rob.herring@calxeda.com> >>>> >>>> All but background ops are atomic on the pl310, so a spinlock is not >>>> needed for a cache sync if background operations are not used. Using >>>> background ops was an optimization for flushing large buffers, but that's >>>> not needed for platforms where i/o is coherent and/or that have a larger >>>> cache size than likely to flush at once. The cache sync spinlock is >>>> taken on every readl/writel and can be a bottleneck for code paths with >>>> register accesses. >>>> >>>> The default behaviour is unchanged. Platforms can enable using atomic >>>> cache ops only by adding "arm,use-atomic-ops" to pl310 device-tree >>>> node. >>>> >>>> It is assumed that remaining background ops are only used in non-SMP >>>> code paths. >>> >>> (a) you're not describing hardware, you're describing a feature of the >>> Linux kernel's implementation - no, you're describing a configuration >>> feature of the kernel. This should not be in DT. >>> (b) you're disabling the optimization to avoid doing a lengthy line-by-line >>> cache operation when the size is larger than the cache size, and trading >>> it for possibly slightly quicker accesses. >>> >>> The problem I have with this is it's likely that you've only looked at >>> "oh, this makes IO accesses faster" and not the total picture wrt the >>> effect on time taken by the DMA API. >> >> No, I'm looking at the system performance and profiling led me here. If >> i/o is coherent, then L2 cache operations are never called by the DMA >> mapping API. So we have a optimization using background ops that I will >> never hit and that optimization gives me a negative performance impact. >> >> I could easily "fix" this in the xgmac driver by using __raw_writel or >> writel_relaxed, but that would break non-coherent operation which does >> need the L2 cache barriers. > > I haven't looked at the xgmac driver but you can use the relaxed > accessors entirely in this file with explicit mb() where the barrier > is needed. It may give you some improvement, though you still issue an > L2 sync as part of mb(). The only issue is that the relaxed accessors > are not defined on all architectures, nor the semantics are clear > between those that define it. Right, as I said I could do that, but it leaves all the issues you mention and would reduce compile coverage (restricting it to ARM). Also, I don't think the driver should be aware if it is coherent or not. It would have to be to make the barriers conditional. The number of barriers is already at the minimum I need, so the relaxed variants don't help unless I break non-coherent mode. > BTW, if you use the streaming DMA API, the barrier should be (or can > be added) part of the API implementation. For the coherent API, > ideally we should implement the ordering between I/O and DMA > operations as part of the API but there isn't such functionality > provided (nor driver writers willing to do this explicitly), so we > overload the mb() and I/O accessors. It's not really a barrier with DMA API that is needed. It is the buffer descriptor ring setup and register write to go poll the descriptor ring that need ordering. Within the descriptor ring setup, I need to ensure the 1st descriptor is marked ready after all the other descriptors that follow it in case the DMA is active. Then I need an additional barrier to ensure the 1st descriptor is in memory when I set the DMA poll bit in the xgmac register. Rob
On Mon, Sep 17, 2012 at 03:29:14PM -0500, Rob Herring wrote: > On 09/17/2012 02:47 PM, Russell King - ARM Linux wrote: > > (a) you're not describing hardware, you're describing a feature of the > > Linux kernel's implementation - no, you're describing a configuration > > feature of the kernel. This should not be in DT. > > (b) you're disabling the optimization to avoid doing a lengthy line-by-line > > cache operation when the size is larger than the cache size, and trading > > it for possibly slightly quicker accesses. > > > > The problem I have with this is it's likely that you've only looked at > > "oh, this makes IO accesses faster" and not the total picture wrt the > > effect on time taken by the DMA API. > > I could easily "fix" this in the xgmac driver by using __raw_writel or > writel_relaxed, but that would break non-coherent operation which does > need the L2 cache barriers. Note that in many cases, drivers do not need a barrier before every write or after every read operation. In terms of L2 coherency, the only time that L2 actually needs to be touched is when your MMIO access provokes the device to read from memory. Most accesses do not. The solution - using the relaxed MMIO accessors, not fiddling around with the L2 cache code. However, there's an in-built problem here - which I'm told of by PowerPC folk. Apparantly, Linus refuses to accept that anyone needs relaxed MMIO accessors and essentially says "all platforms must behave like x86 does". I don't actually know Linus' real position, or what his position would be if he new that without relaxed accessors, we have to do a very time consuming operation... It might be worth broaching the subject of true relaxed MMIO accessors again. The only issue with that approach is getting people who don't have the problem (iow, x86 driver authors) to understand this issue - and that's where I absolutely agree with Linus' apparant position... A better solution than relaxed IO accessors maybe a barrier, but don't call it that. Maybe call it "pre_dma();" and "post_dma();" - that way people won't get bogged down in barrier terminology or whatever. They just call "pre_dma();" before they write to a register which provokes DMA, and "post_dma();" before they read from a register to get DMA status. What platforms do for each of those calls is up to them, and whether platforms use relaxed MMIO accessors becomes their choice too. Another but - we have a hell of a lot of drivers... and existing accessors must work as per x86... > It's not without precedence, but you are right that it is a questionable > use of DT. I'm open to suggestions of how you would expose this > configuration to platforms. We obviously don't want a compile time > option. Some options: > > - Add a flags param to l2x0_of_init to configure it > - Add non-background version of functions and override the function ptrs > in platform code. > - Add a new call to configure l2 settings like this. > - Expose the use_background_ops variable directly - command line setting which is available to everyone whether they're using DT or not.
On Mon, 17 Sep 2012, Rob Herring wrote: > On 09/17/2012 02:47 PM, Russell King - ARM Linux wrote: > > Plus, because there's little justification behind this patch (apart > > from a vague reference to the speed of IO accesses) I'm going to say > > NAK to this until further information is forthcoming justifying this > > change. > > > > 30% faster network transmit rate with pktgen on highbank. Please always include your benchmark results in the patch description. That is highly valuable information when looking at commit logs. Nicolas
On 09/17/2012 04:47 PM, Russell King - ARM Linux wrote: > On Mon, Sep 17, 2012 at 03:29:14PM -0500, Rob Herring wrote: >> On 09/17/2012 02:47 PM, Russell King - ARM Linux wrote: >>> (a) you're not describing hardware, you're describing a feature of the >>> Linux kernel's implementation - no, you're describing a configuration >>> feature of the kernel. This should not be in DT. >>> (b) you're disabling the optimization to avoid doing a lengthy line-by-line >>> cache operation when the size is larger than the cache size, and trading >>> it for possibly slightly quicker accesses. >>> >>> The problem I have with this is it's likely that you've only looked at >>> "oh, this makes IO accesses faster" and not the total picture wrt the >>> effect on time taken by the DMA API. >> >> I could easily "fix" this in the xgmac driver by using __raw_writel or >> writel_relaxed, but that would break non-coherent operation which does >> need the L2 cache barriers. > > Note that in many cases, drivers do not need a barrier before every write > or after every read operation. > > In terms of L2 coherency, the only time that L2 actually needs to be > touched is when your MMIO access provokes the device to read from memory. > Most accesses do not. Agreed. The current approach is a bit of a sledge hammer for relatively few places that need it. > The solution - using the relaxed MMIO accessors, not fiddling around > with the L2 cache code. I have done that in critical paths as well. As I explained in my response to Catalin, too many barriers is not my problem. The network transmit path is a single writel and one explicit wmb already. Using relaxed accessors will only make both wmb's explicit. I still need them at least for the L1 in the coherent case. Even if I'm coherent, I still have to ensure the order the DMA descriptors are written and the order of DMA descriptor writes with the DMA poll register write. But there are not any universal L1 only barriers. The sequence is like this: Write out N DMA descriptors barrier Write 1st DMA descriptor's ready bit barrier Write DMA poll register For non-coherent, the barrier needs to be a dsb and L2 sync. For coherent, this only really needs to be a dsb. Removal of the spinlock is enough to make the L2 sync a don't care. > However, there's an in-built problem here - which I'm told of by PowerPC > folk. Apparantly, Linus refuses to accept that anyone needs relaxed MMIO > accessors and essentially says "all platforms must behave like x86 does". > I don't actually know Linus' real position, or what his position would be > if he new that without relaxed accessors, we have to do a very time > consuming operation... It might be worth broaching the subject of true > relaxed MMIO accessors again. > > The only issue with that approach is getting people who don't have the > problem (iow, x86 driver authors) to understand this issue - and that's > where I absolutely agree with Linus' apparant position... Also, how much of this knowledge should be in the driver? While I really don't care if the xgmac driver works in non-coherent mode, that's clearly not going to be acceptable. Differences between coherent or non-coherent need to stay abstracted out of the drivers. > A better solution than relaxed IO accessors maybe a barrier, but don't > call it that. Maybe call it "pre_dma();" and "post_dma();" - that way > people won't get bogged down in barrier terminology or whatever. They > just call "pre_dma();" before they write to a register which provokes > DMA, and "post_dma();" before they read from a register to get DMA > status. What platforms do for each of those calls is up to them, and > whether platforms use relaxed MMIO accessors becomes their choice too. If pre_dma() still calls l2x0 cache sync which takes a spinlock, then I'm back to the same problem. These would need to be per device as well. Perhaps it could be part of the DMA mapping API ops. Rob > Another but - we have a hell of a lot of drivers... and existing > accessors must work as per x86... > >> It's not without precedence, but you are right that it is a questionable >> use of DT. I'm open to suggestions of how you would expose this >> configuration to platforms. We obviously don't want a compile time >> option. Some options: >> >> - Add a flags param to l2x0_of_init to configure it >> - Add non-background version of functions and override the function ptrs >> in platform code. >> - Add a new call to configure l2 settings like this. >> - Expose the use_background_ops variable directly > > - command line setting which is available to everyone whether they're using > DT or not. >
On Tue, Sep 18, 2012 at 03:43:55AM +0100, Rob Herring wrote: > On 09/17/2012 04:47 PM, Russell King - ARM Linux wrote: > > On Mon, Sep 17, 2012 at 03:29:14PM -0500, Rob Herring wrote: > >> On 09/17/2012 02:47 PM, Russell King - ARM Linux wrote: > >>> (a) you're not describing hardware, you're describing a feature of the > >>> Linux kernel's implementation - no, you're describing a configuration > >>> feature of the kernel. This should not be in DT. > >>> (b) you're disabling the optimization to avoid doing a lengthy line-by-line > >>> cache operation when the size is larger than the cache size, and trading > >>> it for possibly slightly quicker accesses. > >>> > >>> The problem I have with this is it's likely that you've only looked at > >>> "oh, this makes IO accesses faster" and not the total picture wrt the > >>> effect on time taken by the DMA API. > >> > >> I could easily "fix" this in the xgmac driver by using __raw_writel or > >> writel_relaxed, but that would break non-coherent operation which does > >> need the L2 cache barriers. > > > > Note that in many cases, drivers do not need a barrier before every write > > or after every read operation. > > > > In terms of L2 coherency, the only time that L2 actually needs to be > > touched is when your MMIO access provokes the device to read from memory. > > Most accesses do not. > > Agreed. The current approach is a bit of a sledge hammer for relatively > few places that need it. > > > The solution - using the relaxed MMIO accessors, not fiddling around > > with the L2 cache code. > > I have done that in critical paths as well. As I explained in my > response to Catalin, too many barriers is not my problem. The network > transmit path is a single writel and one explicit wmb already. Using > relaxed accessors will only make both wmb's explicit. I still need them > at least for the L1 in the coherent case. Even if I'm coherent, I still > have to ensure the order the DMA descriptors are written and the order > of DMA descriptor writes with the DMA poll register write. But there are > not any universal L1 only barriers. > > The sequence is like this: > Write out N DMA descriptors > barrier > Write 1st DMA descriptor's ready bit > barrier > Write DMA poll register > > For non-coherent, the barrier needs to be a dsb and L2 sync. For > coherent, this only really needs to be a dsb. Removal of the spinlock is > enough to make the L2 sync a don't care. Is your platform coherent for all devices? You could override the outer_sync() to be a no-op. You wouldn't need it for other operations and DSB is enough. > > However, there's an in-built problem here - which I'm told of by PowerPC > > folk. Apparantly, Linus refuses to accept that anyone needs relaxed MMIO > > accessors and essentially says "all platforms must behave like x86 does". > > I don't actually know Linus' real position, or what his position would be > > if he new that without relaxed accessors, we have to do a very time > > consuming operation... It might be worth broaching the subject of true > > relaxed MMIO accessors again. > > > > The only issue with that approach is getting people who don't have the > > problem (iow, x86 driver authors) to understand this issue - and that's > > where I absolutely agree with Linus' apparant position... > > Also, how much of this knowledge should be in the driver? While I really > don't care if the xgmac driver works in non-coherent mode, that's > clearly not going to be acceptable. Differences between coherent or > non-coherent need to stay abstracted out of the drivers. I agree. So you either change outer_sync() for your platform if it is fully coherent or improve the DMA API (the latter requiring wider acceptance, though it has more benefits). Dropping the locks in cache-l2x0.c is not the right solution. It seems that you don't actually need the L2 cache sync for your device at all.
On 09/18/2012 03:21 AM, Catalin Marinas wrote: > On Tue, Sep 18, 2012 at 03:43:55AM +0100, Rob Herring wrote: >> On 09/17/2012 04:47 PM, Russell King - ARM Linux wrote: >>> On Mon, Sep 17, 2012 at 03:29:14PM -0500, Rob Herring wrote: >>>> On 09/17/2012 02:47 PM, Russell King - ARM Linux wrote: >>>>> (a) you're not describing hardware, you're describing a feature of the >>>>> Linux kernel's implementation - no, you're describing a configuration >>>>> feature of the kernel. This should not be in DT. >>>>> (b) you're disabling the optimization to avoid doing a lengthy line-by-line >>>>> cache operation when the size is larger than the cache size, and trading >>>>> it for possibly slightly quicker accesses. >>>>> >>>>> The problem I have with this is it's likely that you've only looked at >>>>> "oh, this makes IO accesses faster" and not the total picture wrt the >>>>> effect on time taken by the DMA API. >>>> >>>> I could easily "fix" this in the xgmac driver by using __raw_writel or >>>> writel_relaxed, but that would break non-coherent operation which does >>>> need the L2 cache barriers. >>> >>> Note that in many cases, drivers do not need a barrier before every write >>> or after every read operation. >>> >>> In terms of L2 coherency, the only time that L2 actually needs to be >>> touched is when your MMIO access provokes the device to read from memory. >>> Most accesses do not. >> >> Agreed. The current approach is a bit of a sledge hammer for relatively >> few places that need it. >> >>> The solution - using the relaxed MMIO accessors, not fiddling around >>> with the L2 cache code. >> >> I have done that in critical paths as well. As I explained in my >> response to Catalin, too many barriers is not my problem. The network >> transmit path is a single writel and one explicit wmb already. Using >> relaxed accessors will only make both wmb's explicit. I still need them >> at least for the L1 in the coherent case. Even if I'm coherent, I still >> have to ensure the order the DMA descriptors are written and the order >> of DMA descriptor writes with the DMA poll register write. But there are >> not any universal L1 only barriers. >> >> The sequence is like this: >> Write out N DMA descriptors >> barrier >> Write 1st DMA descriptor's ready bit >> barrier >> Write DMA poll register >> >> For non-coherent, the barrier needs to be a dsb and L2 sync. For >> coherent, this only really needs to be a dsb. Removal of the spinlock is >> enough to make the L2 sync a don't care. > > Is your platform coherent for all devices? You could override the > outer_sync() to be a no-op. You wouldn't need it for other operations > and DSB is enough. No. coherency is per device and PCI is never coherent. >>> However, there's an in-built problem here - which I'm told of by PowerPC >>> folk. Apparantly, Linus refuses to accept that anyone needs relaxed MMIO >>> accessors and essentially says "all platforms must behave like x86 does". >>> I don't actually know Linus' real position, or what his position would be >>> if he new that without relaxed accessors, we have to do a very time >>> consuming operation... It might be worth broaching the subject of true >>> relaxed MMIO accessors again. >>> >>> The only issue with that approach is getting people who don't have the >>> problem (iow, x86 driver authors) to understand this issue - and that's >>> where I absolutely agree with Linus' apparant position... >> >> Also, how much of this knowledge should be in the driver? While I really >> don't care if the xgmac driver works in non-coherent mode, that's >> clearly not going to be acceptable. Differences between coherent or >> non-coherent need to stay abstracted out of the drivers. > > I agree. So you either change outer_sync() for your platform if it is > fully coherent or improve the DMA API (the latter requiring wider > acceptance, though it has more benefits). > > Dropping the locks in cache-l2x0.c is not the right solution. It seems > that you don't actually need the L2 cache sync for your device at all. You had the same solution removing all the spinlocks before the background flushes got added to the range operations. It is quite sad that a register access on CortexA9 is a spinlock and a register write in addition to the actually access. Why do I have to pay that penalty when I'll never do >4MB buffer flushes? Rob
On 09/17/2012 05:31 PM, Nicolas Pitre wrote: > On Mon, 17 Sep 2012, Rob Herring wrote: > >> On 09/17/2012 02:47 PM, Russell King - ARM Linux wrote: >>> Plus, because there's little justification behind this patch (apart >>> from a vague reference to the speed of IO accesses) I'm going to say >>> NAK to this until further information is forthcoming justifying this >>> change. >>> >> >> 30% faster network transmit rate with pktgen on highbank. > > Please always include your benchmark results in the patch description. > That is highly valuable information when looking at commit logs. > Agreed as it would now be helpful if the original optimization to use background ops had some data as well: ARM: l2x0: Optimise the range based operations For the big buffers which are in excess of cache size, the maintaince operations by PA are very slow. For such buffers the maintainace operations can be speeded up by using the WAY based method. Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Linus Walleij <linus.walleij@stericsson.com> Rob
diff --git a/Documentation/devicetree/bindings/arm/l2cc.txt b/Documentation/devicetree/bindings/arm/l2cc.txt index 7ca5216..907c066 100644 --- a/Documentation/devicetree/bindings/arm/l2cc.txt +++ b/Documentation/devicetree/bindings/arm/l2cc.txt @@ -29,6 +29,9 @@ Optional properties: filter. Addresses in the filter window are directed to the M1 port. Other addresses will go to the M0 port. - interrupts : 1 combined interrupt. +- arm,use-atomic-ops : If present only use atomic cache flush operations and + don't use background operations except for non-SMP safe locations (boot and + shutdown). Example: diff --git a/arch/arm/mm/cache-l2x0.c b/arch/arm/mm/cache-l2x0.c index 2a8e380..e3b2ac2 100644 --- a/arch/arm/mm/cache-l2x0.c +++ b/arch/arm/mm/cache-l2x0.c @@ -33,6 +33,7 @@ static DEFINE_RAW_SPINLOCK(l2x0_lock); static u32 l2x0_way_mask; /* Bitmask of active ways */ static u32 l2x0_size; static unsigned long sync_reg_offset = L2X0_CACHE_SYNC; +static bool use_background_ops = true; struct l2x0_regs l2x0_saved_regs; @@ -130,6 +131,11 @@ static void l2x0_cache_sync(void) raw_spin_unlock_irqrestore(&l2x0_lock, flags); } +static void l2x0_cache_sync_nolock(void) +{ + cache_sync(); +} + static void __l2x0_flush_all(void) { debug_writel(0x03); @@ -219,7 +225,7 @@ static void l2x0_clean_range(unsigned long start, unsigned long end) void __iomem *base = l2x0_base; unsigned long flags; - if ((end - start) >= l2x0_size) { + if (use_background_ops && ((end - start) >= l2x0_size)) { l2x0_clean_all(); return; } @@ -249,7 +255,7 @@ static void l2x0_flush_range(unsigned long start, unsigned long end) void __iomem *base = l2x0_base; unsigned long flags; - if ((end - start) >= l2x0_size) { + if (use_background_ops && ((end - start) >= l2x0_size)) { l2x0_flush_all(); return; } @@ -379,7 +385,8 @@ void __init l2x0_init(void __iomem *base, u32 aux_val, u32 aux_mask) outer_cache.inv_range = l2x0_inv_range; outer_cache.clean_range = l2x0_clean_range; outer_cache.flush_range = l2x0_flush_range; - outer_cache.sync = l2x0_cache_sync; + if (!outer_cache.sync) + outer_cache.sync = l2x0_cache_sync; outer_cache.flush_all = l2x0_flush_all; outer_cache.inv_all = l2x0_inv_all; outer_cache.disable = l2x0_disable; @@ -456,6 +463,11 @@ static void __init pl310_of_setup(const struct device_node *np, writel_relaxed((filter[0] & ~(SZ_1M - 1)) | L2X0_ADDR_FILTER_EN, l2x0_base + L2X0_ADDR_FILTER_START); } + + if (of_property_read_bool(np, "arm,use-atomic-ops")) { + use_background_ops = false; + outer_cache.sync = l2x0_cache_sync_nolock; + } } static void __init pl310_save(void)