diff mbox

[v2] ARM: l2x0: make background cache ops optional for clean and flush range

Message ID 1347890398-22088-1-git-send-email-robherring2@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Rob Herring Sept. 17, 2012, 1:59 p.m. UTC
From: Rob Herring <rob.herring@calxeda.com>

All but background ops are atomic on the pl310, so a spinlock is not
needed for a cache sync if background operations are not used. Using
background ops was an optimization for flushing large buffers, but that's
not needed for platforms where i/o is coherent and/or that have a larger
cache size than likely to flush at once. The cache sync spinlock is
taken on every readl/writel and can be a bottleneck for code paths with
register accesses.

The default behaviour is unchanged. Platforms can enable using atomic
cache ops only by adding "arm,use-atomic-ops" to pl310 device-tree
node.

It is assumed that remaining background ops are only used in non-SMP
code paths.

Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
---
 Documentation/devicetree/bindings/arm/l2cc.txt |    3 +++
 arch/arm/mm/cache-l2x0.c                       |   18 +++++++++++++++---
 2 files changed, 18 insertions(+), 3 deletions(-)

Comments

Russell King - ARM Linux Sept. 17, 2012, 7:47 p.m. UTC | #1
On Mon, Sep 17, 2012 at 08:59:58AM -0500, Rob Herring wrote:
> From: Rob Herring <rob.herring@calxeda.com>
> 
> All but background ops are atomic on the pl310, so a spinlock is not
> needed for a cache sync if background operations are not used. Using
> background ops was an optimization for flushing large buffers, but that's
> not needed for platforms where i/o is coherent and/or that have a larger
> cache size than likely to flush at once. The cache sync spinlock is
> taken on every readl/writel and can be a bottleneck for code paths with
> register accesses.
> 
> The default behaviour is unchanged. Platforms can enable using atomic
> cache ops only by adding "arm,use-atomic-ops" to pl310 device-tree
> node.
> 
> It is assumed that remaining background ops are only used in non-SMP
> code paths.

(a) you're not describing hardware, you're describing a feature of the
    Linux kernel's implementation - no, you're describing a configuration
    feature of the kernel.  This should not be in DT.
(b) you're disabling the optimization to avoid doing a lengthy line-by-line
    cache operation when the size is larger than the cache size, and trading
    it for possibly slightly quicker accesses.

The problem I have with this is it's likely that you've only looked at
"oh, this makes IO accesses faster" and not the total picture wrt the
effect on time taken by the DMA API.

Plus, because there's little justification behind this patch (apart
from a vague reference to the speed of IO accesses) I'm going to say
NAK to this until further information is forthcoming justifying this
change.

In any case (a) applies whatever.  This is not hardware description,
this is implementation configuration which has nothing to do with DT.
Don't use DT as a dumping ground for implementation configuration.
Rob Herring Sept. 17, 2012, 8:29 p.m. UTC | #2
On 09/17/2012 02:47 PM, Russell King - ARM Linux wrote:
> On Mon, Sep 17, 2012 at 08:59:58AM -0500, Rob Herring wrote:
>> From: Rob Herring <rob.herring@calxeda.com>
>>
>> All but background ops are atomic on the pl310, so a spinlock is not
>> needed for a cache sync if background operations are not used. Using
>> background ops was an optimization for flushing large buffers, but that's
>> not needed for platforms where i/o is coherent and/or that have a larger
>> cache size than likely to flush at once. The cache sync spinlock is
>> taken on every readl/writel and can be a bottleneck for code paths with
>> register accesses.
>>
>> The default behaviour is unchanged. Platforms can enable using atomic
>> cache ops only by adding "arm,use-atomic-ops" to pl310 device-tree
>> node.
>>
>> It is assumed that remaining background ops are only used in non-SMP
>> code paths.
> 
> (a) you're not describing hardware, you're describing a feature of the
>     Linux kernel's implementation - no, you're describing a configuration
>     feature of the kernel.  This should not be in DT.
> (b) you're disabling the optimization to avoid doing a lengthy line-by-line
>     cache operation when the size is larger than the cache size, and trading
>     it for possibly slightly quicker accesses.
> 
> The problem I have with this is it's likely that you've only looked at
> "oh, this makes IO accesses faster" and not the total picture wrt the
> effect on time taken by the DMA API.

No, I'm looking at the system performance and profiling led me here. If
i/o is coherent, then L2 cache operations are never called by the DMA
mapping API. So we have a optimization using background ops that I will
never hit and that optimization gives me a negative performance impact.

I could easily "fix" this in the xgmac driver by using __raw_writel or
writel_relaxed, but that would break non-coherent operation which does
need the L2 cache barriers.

Even with non-coherent i/o, I'm not likely to be doing >4MB at a time
flushes on my system. It is certainly impossible for networking. I'm not
sure about SATA, but I'd guess that size is also unlikely.

> Plus, because there's little justification behind this patch (apart
> from a vague reference to the speed of IO accesses) I'm going to say
> NAK to this until further information is forthcoming justifying this
> change.
> 

30% faster network transmit rate with pktgen on highbank.

> In any case (a) applies whatever.  This is not hardware description,
> this is implementation configuration which has nothing to do with DT.
> Don't use DT as a dumping ground for implementation configuration.

It's not without precedence, but you are right that it is a questionable
use of DT. I'm open to suggestions of how you would expose this
configuration to platforms. We obviously don't want a compile time
option. Some options:

- Add a flags param to l2x0_of_init to configure it
- Add non-background version of functions and override the function ptrs
in platform code.
- Add a new call to configure l2 settings like this.
- Expose the use_background_ops variable directly

Rob
Catalin Marinas Sept. 17, 2012, 8:45 p.m. UTC | #3
On 17 September 2012 21:29, Rob Herring <robherring2@gmail.com> wrote:
> On 09/17/2012 02:47 PM, Russell King - ARM Linux wrote:
>> On Mon, Sep 17, 2012 at 08:59:58AM -0500, Rob Herring wrote:
>>> From: Rob Herring <rob.herring@calxeda.com>
>>>
>>> All but background ops are atomic on the pl310, so a spinlock is not
>>> needed for a cache sync if background operations are not used. Using
>>> background ops was an optimization for flushing large buffers, but that's
>>> not needed for platforms where i/o is coherent and/or that have a larger
>>> cache size than likely to flush at once. The cache sync spinlock is
>>> taken on every readl/writel and can be a bottleneck for code paths with
>>> register accesses.
>>>
>>> The default behaviour is unchanged. Platforms can enable using atomic
>>> cache ops only by adding "arm,use-atomic-ops" to pl310 device-tree
>>> node.
>>>
>>> It is assumed that remaining background ops are only used in non-SMP
>>> code paths.
>>
>> (a) you're not describing hardware, you're describing a feature of the
>>     Linux kernel's implementation - no, you're describing a configuration
>>     feature of the kernel.  This should not be in DT.
>> (b) you're disabling the optimization to avoid doing a lengthy line-by-line
>>     cache operation when the size is larger than the cache size, and trading
>>     it for possibly slightly quicker accesses.
>>
>> The problem I have with this is it's likely that you've only looked at
>> "oh, this makes IO accesses faster" and not the total picture wrt the
>> effect on time taken by the DMA API.
>
> No, I'm looking at the system performance and profiling led me here. If
> i/o is coherent, then L2 cache operations are never called by the DMA
> mapping API. So we have a optimization using background ops that I will
> never hit and that optimization gives me a negative performance impact.
>
> I could easily "fix" this in the xgmac driver by using __raw_writel or
> writel_relaxed, but that would break non-coherent operation which does
> need the L2 cache barriers.

I haven't looked at the xgmac driver but you can use the relaxed
accessors entirely in this file with explicit mb() where the barrier
is needed. It may give you some improvement, though you still issue an
L2 sync as part of mb(). The only issue is that the relaxed accessors
are not defined on all architectures, nor the semantics are clear
between those that define it.

BTW, if you use the streaming DMA API, the barrier should be (or can
be added) part of the API implementation. For the coherent API,
ideally we should implement the ordering between I/O and DMA
operations as part of the API but there isn't such functionality
provided (nor driver writers willing to do this explicitly), so we
overload the mb() and I/O accessors.
Rob Herring Sept. 17, 2012, 9:36 p.m. UTC | #4
On 09/17/2012 03:45 PM, Catalin Marinas wrote:
> On 17 September 2012 21:29, Rob Herring <robherring2@gmail.com> wrote:
>> On 09/17/2012 02:47 PM, Russell King - ARM Linux wrote:
>>> On Mon, Sep 17, 2012 at 08:59:58AM -0500, Rob Herring wrote:
>>>> From: Rob Herring <rob.herring@calxeda.com>
>>>>
>>>> All but background ops are atomic on the pl310, so a spinlock is not
>>>> needed for a cache sync if background operations are not used. Using
>>>> background ops was an optimization for flushing large buffers, but that's
>>>> not needed for platforms where i/o is coherent and/or that have a larger
>>>> cache size than likely to flush at once. The cache sync spinlock is
>>>> taken on every readl/writel and can be a bottleneck for code paths with
>>>> register accesses.
>>>>
>>>> The default behaviour is unchanged. Platforms can enable using atomic
>>>> cache ops only by adding "arm,use-atomic-ops" to pl310 device-tree
>>>> node.
>>>>
>>>> It is assumed that remaining background ops are only used in non-SMP
>>>> code paths.
>>>
>>> (a) you're not describing hardware, you're describing a feature of the
>>>     Linux kernel's implementation - no, you're describing a configuration
>>>     feature of the kernel.  This should not be in DT.
>>> (b) you're disabling the optimization to avoid doing a lengthy line-by-line
>>>     cache operation when the size is larger than the cache size, and trading
>>>     it for possibly slightly quicker accesses.
>>>
>>> The problem I have with this is it's likely that you've only looked at
>>> "oh, this makes IO accesses faster" and not the total picture wrt the
>>> effect on time taken by the DMA API.
>>
>> No, I'm looking at the system performance and profiling led me here. If
>> i/o is coherent, then L2 cache operations are never called by the DMA
>> mapping API. So we have a optimization using background ops that I will
>> never hit and that optimization gives me a negative performance impact.
>>
>> I could easily "fix" this in the xgmac driver by using __raw_writel or
>> writel_relaxed, but that would break non-coherent operation which does
>> need the L2 cache barriers.
> 
> I haven't looked at the xgmac driver but you can use the relaxed
> accessors entirely in this file with explicit mb() where the barrier
> is needed. It may give you some improvement, though you still issue an
> L2 sync as part of mb(). The only issue is that the relaxed accessors
> are not defined on all architectures, nor the semantics are clear
> between those that define it.

Right, as I said I could do that, but it leaves all the issues you
mention and would reduce compile coverage (restricting it to ARM).

Also, I don't think the driver should be aware if it is coherent or not.
It would have to be to make the barriers conditional. The number of
barriers is already at the minimum I need, so the relaxed variants don't
help unless I break non-coherent mode.

> BTW, if you use the streaming DMA API, the barrier should be (or can
> be added) part of the API implementation. For the coherent API,
> ideally we should implement the ordering between I/O and DMA
> operations as part of the API but there isn't such functionality
> provided (nor driver writers willing to do this explicitly), so we
> overload the mb() and I/O accessors.

It's not really a barrier with DMA API that is needed. It is the buffer
descriptor ring setup and register write to go poll the descriptor ring
that need ordering. Within the descriptor ring setup, I need to ensure
the 1st descriptor is marked ready after all the other descriptors that
follow it in case the DMA is active. Then I need an additional barrier
to ensure the 1st descriptor is in memory when I set the DMA poll bit in
the xgmac register.

Rob
Russell King - ARM Linux Sept. 17, 2012, 9:47 p.m. UTC | #5
On Mon, Sep 17, 2012 at 03:29:14PM -0500, Rob Herring wrote:
> On 09/17/2012 02:47 PM, Russell King - ARM Linux wrote:
> > (a) you're not describing hardware, you're describing a feature of the
> >     Linux kernel's implementation - no, you're describing a configuration
> >     feature of the kernel.  This should not be in DT.
> > (b) you're disabling the optimization to avoid doing a lengthy line-by-line
> >     cache operation when the size is larger than the cache size, and trading
> >     it for possibly slightly quicker accesses.
> > 
> > The problem I have with this is it's likely that you've only looked at
> > "oh, this makes IO accesses faster" and not the total picture wrt the
> > effect on time taken by the DMA API.
> 
> I could easily "fix" this in the xgmac driver by using __raw_writel or
> writel_relaxed, but that would break non-coherent operation which does
> need the L2 cache barriers.

Note that in many cases, drivers do not need a barrier before every write
or after every read operation.

In terms of L2 coherency, the only time that L2 actually needs to be
touched is when your MMIO access provokes the device to read from memory.
Most accesses do not.

The solution - using the relaxed MMIO accessors, not fiddling around
with the L2 cache code.

However, there's an in-built problem here - which I'm told of by PowerPC
folk.  Apparantly, Linus refuses to accept that anyone needs relaxed MMIO
accessors and essentially says "all platforms must behave like x86 does".
I don't actually know Linus' real position, or what his position would be
if he new that without relaxed accessors, we have to do a very time
consuming operation...  It might be worth broaching the subject of true
relaxed MMIO accessors again.

The only issue with that approach is getting people who don't have the
problem (iow, x86 driver authors) to understand this issue - and that's
where I absolutely agree with Linus' apparant position...

A better solution than relaxed IO accessors maybe a barrier, but don't
call it that.  Maybe call it "pre_dma();" and "post_dma();" - that way
people won't get bogged down in barrier terminology or whatever.  They
just call "pre_dma();" before they write to a register which provokes
DMA, and "post_dma();" before they read from a register to get DMA
status.  What platforms do for each of those calls is up to them, and
whether platforms use relaxed MMIO accessors becomes their choice too.

Another but - we have a hell of a lot of drivers... and existing
accessors must work as per x86...

> It's not without precedence, but you are right that it is a questionable
> use of DT. I'm open to suggestions of how you would expose this
> configuration to platforms. We obviously don't want a compile time
> option. Some options:
> 
> - Add a flags param to l2x0_of_init to configure it
> - Add non-background version of functions and override the function ptrs
> in platform code.
> - Add a new call to configure l2 settings like this.
> - Expose the use_background_ops variable directly

- command line setting which is available to everyone whether they're using
DT or not.
Nicolas Pitre Sept. 17, 2012, 10:31 p.m. UTC | #6
On Mon, 17 Sep 2012, Rob Herring wrote:

> On 09/17/2012 02:47 PM, Russell King - ARM Linux wrote:
> > Plus, because there's little justification behind this patch (apart
> > from a vague reference to the speed of IO accesses) I'm going to say
> > NAK to this until further information is forthcoming justifying this
> > change.
> > 
> 
> 30% faster network transmit rate with pktgen on highbank.

Please always include your benchmark results in the patch description. 
That is highly valuable information when looking at commit logs.


Nicolas
Rob Herring Sept. 18, 2012, 2:43 a.m. UTC | #7
On 09/17/2012 04:47 PM, Russell King - ARM Linux wrote:
> On Mon, Sep 17, 2012 at 03:29:14PM -0500, Rob Herring wrote:
>> On 09/17/2012 02:47 PM, Russell King - ARM Linux wrote:
>>> (a) you're not describing hardware, you're describing a feature of the
>>>     Linux kernel's implementation - no, you're describing a configuration
>>>     feature of the kernel.  This should not be in DT.
>>> (b) you're disabling the optimization to avoid doing a lengthy line-by-line
>>>     cache operation when the size is larger than the cache size, and trading
>>>     it for possibly slightly quicker accesses.
>>>
>>> The problem I have with this is it's likely that you've only looked at
>>> "oh, this makes IO accesses faster" and not the total picture wrt the
>>> effect on time taken by the DMA API.
>>
>> I could easily "fix" this in the xgmac driver by using __raw_writel or
>> writel_relaxed, but that would break non-coherent operation which does
>> need the L2 cache barriers.
> 
> Note that in many cases, drivers do not need a barrier before every write
> or after every read operation.
> 
> In terms of L2 coherency, the only time that L2 actually needs to be
> touched is when your MMIO access provokes the device to read from memory.
> Most accesses do not.

Agreed. The current approach is a bit of a sledge hammer for relatively
few places that need it.

> The solution - using the relaxed MMIO accessors, not fiddling around
> with the L2 cache code.

I have done that in critical paths as well. As I explained in my
response to Catalin, too many barriers is not my problem. The network
transmit path is a single writel and one explicit wmb already. Using
relaxed accessors will only make both wmb's explicit. I still need them
at least for the L1 in the coherent case. Even if I'm coherent, I still
have to ensure the order the DMA descriptors are written and the order
of DMA descriptor writes with the DMA poll register write. But there are
not any universal L1 only barriers.

The sequence is like this:
Write out N DMA descriptors
barrier
Write 1st DMA descriptor's ready bit
barrier
Write DMA poll register

For non-coherent, the barrier needs to be a dsb and L2 sync. For
coherent, this only really needs to be a dsb. Removal of the spinlock is
enough to make the L2 sync a don't care.

> However, there's an in-built problem here - which I'm told of by PowerPC
> folk.  Apparantly, Linus refuses to accept that anyone needs relaxed MMIO
> accessors and essentially says "all platforms must behave like x86 does".
> I don't actually know Linus' real position, or what his position would be
> if he new that without relaxed accessors, we have to do a very time
> consuming operation...  It might be worth broaching the subject of true
> relaxed MMIO accessors again.
> 
> The only issue with that approach is getting people who don't have the
> problem (iow, x86 driver authors) to understand this issue - and that's
> where I absolutely agree with Linus' apparant position...

Also, how much of this knowledge should be in the driver? While I really
don't care if the xgmac driver works in non-coherent mode, that's
clearly not going to be acceptable. Differences between coherent or
non-coherent need to stay abstracted out of the drivers.

> A better solution than relaxed IO accessors maybe a barrier, but don't
> call it that.  Maybe call it "pre_dma();" and "post_dma();" - that way
> people won't get bogged down in barrier terminology or whatever.  They
> just call "pre_dma();" before they write to a register which provokes
> DMA, and "post_dma();" before they read from a register to get DMA
> status.  What platforms do for each of those calls is up to them, and
> whether platforms use relaxed MMIO accessors becomes their choice too.

If pre_dma() still calls l2x0 cache sync which takes a spinlock, then
I'm back to the same problem.

These would need to be per device as well. Perhaps it could be part of
the DMA mapping API ops.

Rob

> Another but - we have a hell of a lot of drivers... and existing
> accessors must work as per x86...
> 
>> It's not without precedence, but you are right that it is a questionable
>> use of DT. I'm open to suggestions of how you would expose this
>> configuration to platforms. We obviously don't want a compile time
>> option. Some options:
>>
>> - Add a flags param to l2x0_of_init to configure it
>> - Add non-background version of functions and override the function ptrs
>> in platform code.
>> - Add a new call to configure l2 settings like this.
>> - Expose the use_background_ops variable directly
> 
> - command line setting which is available to everyone whether they're using
> DT or not.
>
Catalin Marinas Sept. 18, 2012, 8:21 a.m. UTC | #8
On Tue, Sep 18, 2012 at 03:43:55AM +0100, Rob Herring wrote:
> On 09/17/2012 04:47 PM, Russell King - ARM Linux wrote:
> > On Mon, Sep 17, 2012 at 03:29:14PM -0500, Rob Herring wrote:
> >> On 09/17/2012 02:47 PM, Russell King - ARM Linux wrote:
> >>> (a) you're not describing hardware, you're describing a feature of the
> >>>     Linux kernel's implementation - no, you're describing a configuration
> >>>     feature of the kernel.  This should not be in DT.
> >>> (b) you're disabling the optimization to avoid doing a lengthy line-by-line
> >>>     cache operation when the size is larger than the cache size, and trading
> >>>     it for possibly slightly quicker accesses.
> >>>
> >>> The problem I have with this is it's likely that you've only looked at
> >>> "oh, this makes IO accesses faster" and not the total picture wrt the
> >>> effect on time taken by the DMA API.
> >>
> >> I could easily "fix" this in the xgmac driver by using __raw_writel or
> >> writel_relaxed, but that would break non-coherent operation which does
> >> need the L2 cache barriers.
> > 
> > Note that in many cases, drivers do not need a barrier before every write
> > or after every read operation.
> > 
> > In terms of L2 coherency, the only time that L2 actually needs to be
> > touched is when your MMIO access provokes the device to read from memory.
> > Most accesses do not.
> 
> Agreed. The current approach is a bit of a sledge hammer for relatively
> few places that need it.
> 
> > The solution - using the relaxed MMIO accessors, not fiddling around
> > with the L2 cache code.
> 
> I have done that in critical paths as well. As I explained in my
> response to Catalin, too many barriers is not my problem. The network
> transmit path is a single writel and one explicit wmb already. Using
> relaxed accessors will only make both wmb's explicit. I still need them
> at least for the L1 in the coherent case. Even if I'm coherent, I still
> have to ensure the order the DMA descriptors are written and the order
> of DMA descriptor writes with the DMA poll register write. But there are
> not any universal L1 only barriers.
> 
> The sequence is like this:
> Write out N DMA descriptors
> barrier
> Write 1st DMA descriptor's ready bit
> barrier
> Write DMA poll register
> 
> For non-coherent, the barrier needs to be a dsb and L2 sync. For
> coherent, this only really needs to be a dsb. Removal of the spinlock is
> enough to make the L2 sync a don't care.

Is your platform coherent for all devices? You could override the
outer_sync() to be a no-op. You wouldn't need it for other operations
and DSB is enough.

> > However, there's an in-built problem here - which I'm told of by PowerPC
> > folk.  Apparantly, Linus refuses to accept that anyone needs relaxed MMIO
> > accessors and essentially says "all platforms must behave like x86 does".
> > I don't actually know Linus' real position, or what his position would be
> > if he new that without relaxed accessors, we have to do a very time
> > consuming operation...  It might be worth broaching the subject of true
> > relaxed MMIO accessors again.
> > 
> > The only issue with that approach is getting people who don't have the
> > problem (iow, x86 driver authors) to understand this issue - and that's
> > where I absolutely agree with Linus' apparant position...
> 
> Also, how much of this knowledge should be in the driver? While I really
> don't care if the xgmac driver works in non-coherent mode, that's
> clearly not going to be acceptable. Differences between coherent or
> non-coherent need to stay abstracted out of the drivers.

I agree. So you either change outer_sync() for your platform if it is
fully coherent or improve the DMA API (the latter requiring wider
acceptance, though it has more benefits).

Dropping the locks in cache-l2x0.c is not the right solution. It seems
that you don't actually need the L2 cache sync for your device at all.
Rob Herring Sept. 18, 2012, noon UTC | #9
On 09/18/2012 03:21 AM, Catalin Marinas wrote:
> On Tue, Sep 18, 2012 at 03:43:55AM +0100, Rob Herring wrote:
>> On 09/17/2012 04:47 PM, Russell King - ARM Linux wrote:
>>> On Mon, Sep 17, 2012 at 03:29:14PM -0500, Rob Herring wrote:
>>>> On 09/17/2012 02:47 PM, Russell King - ARM Linux wrote:
>>>>> (a) you're not describing hardware, you're describing a feature of the
>>>>>     Linux kernel's implementation - no, you're describing a configuration
>>>>>     feature of the kernel.  This should not be in DT.
>>>>> (b) you're disabling the optimization to avoid doing a lengthy line-by-line
>>>>>     cache operation when the size is larger than the cache size, and trading
>>>>>     it for possibly slightly quicker accesses.
>>>>>
>>>>> The problem I have with this is it's likely that you've only looked at
>>>>> "oh, this makes IO accesses faster" and not the total picture wrt the
>>>>> effect on time taken by the DMA API.
>>>>
>>>> I could easily "fix" this in the xgmac driver by using __raw_writel or
>>>> writel_relaxed, but that would break non-coherent operation which does
>>>> need the L2 cache barriers.
>>>
>>> Note that in many cases, drivers do not need a barrier before every write
>>> or after every read operation.
>>>
>>> In terms of L2 coherency, the only time that L2 actually needs to be
>>> touched is when your MMIO access provokes the device to read from memory.
>>> Most accesses do not.
>>
>> Agreed. The current approach is a bit of a sledge hammer for relatively
>> few places that need it.
>>
>>> The solution - using the relaxed MMIO accessors, not fiddling around
>>> with the L2 cache code.
>>
>> I have done that in critical paths as well. As I explained in my
>> response to Catalin, too many barriers is not my problem. The network
>> transmit path is a single writel and one explicit wmb already. Using
>> relaxed accessors will only make both wmb's explicit. I still need them
>> at least for the L1 in the coherent case. Even if I'm coherent, I still
>> have to ensure the order the DMA descriptors are written and the order
>> of DMA descriptor writes with the DMA poll register write. But there are
>> not any universal L1 only barriers.
>>
>> The sequence is like this:
>> Write out N DMA descriptors
>> barrier
>> Write 1st DMA descriptor's ready bit
>> barrier
>> Write DMA poll register
>>
>> For non-coherent, the barrier needs to be a dsb and L2 sync. For
>> coherent, this only really needs to be a dsb. Removal of the spinlock is
>> enough to make the L2 sync a don't care.
> 
> Is your platform coherent for all devices? You could override the
> outer_sync() to be a no-op. You wouldn't need it for other operations
> and DSB is enough.

No. coherency is per device and PCI is never coherent.

>>> However, there's an in-built problem here - which I'm told of by PowerPC
>>> folk.  Apparantly, Linus refuses to accept that anyone needs relaxed MMIO
>>> accessors and essentially says "all platforms must behave like x86 does".
>>> I don't actually know Linus' real position, or what his position would be
>>> if he new that without relaxed accessors, we have to do a very time
>>> consuming operation...  It might be worth broaching the subject of true
>>> relaxed MMIO accessors again.
>>>
>>> The only issue with that approach is getting people who don't have the
>>> problem (iow, x86 driver authors) to understand this issue - and that's
>>> where I absolutely agree with Linus' apparant position...
>>
>> Also, how much of this knowledge should be in the driver? While I really
>> don't care if the xgmac driver works in non-coherent mode, that's
>> clearly not going to be acceptable. Differences between coherent or
>> non-coherent need to stay abstracted out of the drivers.
> 
> I agree. So you either change outer_sync() for your platform if it is
> fully coherent or improve the DMA API (the latter requiring wider
> acceptance, though it has more benefits).
> 
> Dropping the locks in cache-l2x0.c is not the right solution. It seems
> that you don't actually need the L2 cache sync for your device at all.

You had the same solution removing all the spinlocks before the
background flushes got added to the range operations.

It is quite sad that a register access on CortexA9 is a spinlock and a
register write in addition to the actually access. Why do I have to pay
that penalty when I'll never do >4MB buffer flushes?

Rob
Rob Herring Sept. 18, 2012, 12:50 p.m. UTC | #10
On 09/17/2012 05:31 PM, Nicolas Pitre wrote:
> On Mon, 17 Sep 2012, Rob Herring wrote:
> 
>> On 09/17/2012 02:47 PM, Russell King - ARM Linux wrote:
>>> Plus, because there's little justification behind this patch (apart
>>> from a vague reference to the speed of IO accesses) I'm going to say
>>> NAK to this until further information is forthcoming justifying this
>>> change.
>>>
>>
>> 30% faster network transmit rate with pktgen on highbank.
> 
> Please always include your benchmark results in the patch description. 
> That is highly valuable information when looking at commit logs.
> 

Agreed as it would now be helpful if the original optimization to use
background ops had some data as well:

    ARM: l2x0: Optimise the range based operations

    For the big buffers which are in excess of cache size, the maintaince
    operations by PA are very slow. For such buffers the maintainace
    operations can be speeded up by using the WAY based method.

    Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
    Acked-by: Catalin Marinas <catalin.marinas@arm.com>
    Acked-by: Linus Walleij <linus.walleij@stericsson.com>

Rob
diff mbox

Patch

diff --git a/Documentation/devicetree/bindings/arm/l2cc.txt b/Documentation/devicetree/bindings/arm/l2cc.txt
index 7ca5216..907c066 100644
--- a/Documentation/devicetree/bindings/arm/l2cc.txt
+++ b/Documentation/devicetree/bindings/arm/l2cc.txt
@@ -29,6 +29,9 @@  Optional properties:
   filter. Addresses in the filter window are directed to the M1 port. Other
   addresses will go to the M0 port.
 - interrupts : 1 combined interrupt.
+- arm,use-atomic-ops : If present only use atomic cache flush operations and
+  don't use background operations except for non-SMP safe locations (boot and
+  shutdown).
 
 Example:
 
diff --git a/arch/arm/mm/cache-l2x0.c b/arch/arm/mm/cache-l2x0.c
index 2a8e380..e3b2ac2 100644
--- a/arch/arm/mm/cache-l2x0.c
+++ b/arch/arm/mm/cache-l2x0.c
@@ -33,6 +33,7 @@  static DEFINE_RAW_SPINLOCK(l2x0_lock);
 static u32 l2x0_way_mask;	/* Bitmask of active ways */
 static u32 l2x0_size;
 static unsigned long sync_reg_offset = L2X0_CACHE_SYNC;
+static bool use_background_ops = true;
 
 struct l2x0_regs l2x0_saved_regs;
 
@@ -130,6 +131,11 @@  static void l2x0_cache_sync(void)
 	raw_spin_unlock_irqrestore(&l2x0_lock, flags);
 }
 
+static void l2x0_cache_sync_nolock(void)
+{
+	cache_sync();
+}
+
 static void __l2x0_flush_all(void)
 {
 	debug_writel(0x03);
@@ -219,7 +225,7 @@  static void l2x0_clean_range(unsigned long start, unsigned long end)
 	void __iomem *base = l2x0_base;
 	unsigned long flags;
 
-	if ((end - start) >= l2x0_size) {
+	if (use_background_ops && ((end - start) >= l2x0_size)) {
 		l2x0_clean_all();
 		return;
 	}
@@ -249,7 +255,7 @@  static void l2x0_flush_range(unsigned long start, unsigned long end)
 	void __iomem *base = l2x0_base;
 	unsigned long flags;
 
-	if ((end - start) >= l2x0_size) {
+	if (use_background_ops && ((end - start) >= l2x0_size)) {
 		l2x0_flush_all();
 		return;
 	}
@@ -379,7 +385,8 @@  void __init l2x0_init(void __iomem *base, u32 aux_val, u32 aux_mask)
 	outer_cache.inv_range = l2x0_inv_range;
 	outer_cache.clean_range = l2x0_clean_range;
 	outer_cache.flush_range = l2x0_flush_range;
-	outer_cache.sync = l2x0_cache_sync;
+	if (!outer_cache.sync)
+		outer_cache.sync = l2x0_cache_sync;
 	outer_cache.flush_all = l2x0_flush_all;
 	outer_cache.inv_all = l2x0_inv_all;
 	outer_cache.disable = l2x0_disable;
@@ -456,6 +463,11 @@  static void __init pl310_of_setup(const struct device_node *np,
 		writel_relaxed((filter[0] & ~(SZ_1M - 1)) | L2X0_ADDR_FILTER_EN,
 			       l2x0_base + L2X0_ADDR_FILTER_START);
 	}
+
+	if (of_property_read_bool(np, "arm,use-atomic-ops")) {
+		use_background_ops = false;
+		outer_cache.sync = l2x0_cache_sync_nolock;
+	}
 }
 
 static void __init pl310_save(void)