diff mbox

ath10k: Replace ioread with wmb for data sync

Message ID 1422311118-11320-1-git-send-email-poh@qca.qualcomm.com (mailing list archive)
State Rejected
Headers show

Commit Message

Peter Oh Jan. 26, 2015, 10:25 p.m. UTC
Using ioread() to perform data sync is excessive.
Use compact API, wmb(), that intended to be used for the case.
It reduces total 14 CPU clocks per interrupt.

Signed-off-by: Peter Oh <poh@qca.qualcomm.com>
---
 drivers/net/wireless/ath/ath10k/pci.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

Comments

Bob Copeland Jan. 27, 2015, 9:33 p.m. UTC | #1
On Mon, Jan 26, 2015 at 02:25:18PM -0800, Peter Oh wrote:
> Using ioread() to perform data sync is excessive.
> Use compact API, wmb(), that intended to be used for the case.
> It reduces total 14 CPU clocks per interrupt.

Hi,

>  	ath10k_pci_write32(ar, SOC_CORE_BASE_ADDRESS + PCIE_INTR_CLR_ADDRESS,
>  			   PCIE_INTR_FIRMWARE_MASK | PCIE_INTR_CE_MASK_ALL);
>  
> -	/* IMPORTANT: this extra read transaction is required to
> -	 * flush the posted write buffer. */
> -	(void)ath10k_pci_read32(ar, SOC_CORE_BASE_ADDRESS +
> -				PCIE_INTR_ENABLE_ADDRESS);
> +	/* invoke data sync barrier */
> +	wmb();
>  }

I am no expert in arcane PCI matters, but that looks suspicious to me.  I seem
to recall wmb() only enforced ordering, and maybe not even memory-IO ordering
on all platforms.  If you want to disable an irq, it really seems like you
would want to flush posted writes so you know the hardware has seen it.
Peter Oh Jan. 27, 2015, 11:53 p.m. UTC | #2
On 01/27/2015 01:33 PM, Bob Copeland wrote:
> On Mon, Jan 26, 2015 at 02:25:18PM -0800, Peter Oh wrote:
>> Using ioread() to perform data sync is excessive.
>> Use compact API, wmb(), that intended to be used for the case.
>> It reduces total 14 CPU clocks per interrupt.
> Hi,
>
>>   	ath10k_pci_write32(ar, SOC_CORE_BASE_ADDRESS + PCIE_INTR_CLR_ADDRESS,
>>   			   PCIE_INTR_FIRMWARE_MASK | PCIE_INTR_CE_MASK_ALL);
>>   
>> -	/* IMPORTANT: this extra read transaction is required to
>> -	 * flush the posted write buffer. */
>> -	(void)ath10k_pci_read32(ar, SOC_CORE_BASE_ADDRESS +
>> -				PCIE_INTR_ENABLE_ADDRESS);
>> +	/* invoke data sync barrier */
>> +	wmb();
>>   }
> I am no expert in arcane PCI matters, but that looks suspicious to me.  I seem
> to recall wmb() only enforced ordering, and maybe not even memory-IO ordering
> on all platforms.  If you want to disable an irq, it really seems like you
> would want to flush posted writes so you know the hardware has seen it.
enforced ordering is happened by flush write buffer and wmb is commonly 
used to flush write buffer.
so that wmb guarantees ordering by flush write buffer. That's why it's 
called a memory barrier.
when wmb is used, it guarantees all memory accesses complete before wmb 
command completes.
for instance dsb is the corresponding command for wmb in arm and arm 
instruction guide says
"The DSB instruction completes when all explicit memory accesses before 
it complete."
Which means DSB instruction will hold the bus (usually AXI bus) and 
never returns until any memory or memory mapped I/O access complete
(dsb is for ARM and other platforms have the equivalent instruction such 
as sfence, sync, mf, and dcs).
Thanks,
Peter
Bob Copeland Jan. 28, 2015, 4:30 a.m. UTC | #3
On Tue, Jan 27, 2015 at 03:53:00PM -0800, Peter Oh wrote:
> >>-	/* IMPORTANT: this extra read transaction is required to
> >>-	 * flush the posted write buffer. */
> >>-	(void)ath10k_pci_read32(ar, SOC_CORE_BASE_ADDRESS +
> >>-				PCIE_INTR_ENABLE_ADDRESS);
> >>+	/* invoke data sync barrier */
> >>+	wmb();
> >>  }
> >I am no expert in arcane PCI matters, but that looks suspicious to me.  I seem
> >to recall wmb() only enforced ordering, and maybe not even memory-IO ordering
> >on all platforms.  If you want to disable an irq, it really seems like you
> >would want to flush posted writes so you know the hardware has seen it.
> enforced ordering is happened by flush write buffer and wmb is
> commonly used to flush write buffer.
> so that wmb guarantees ordering by flush write buffer. That's why
> it's called a memory barrier.

Ok, sure, but I/O ordering two different writes, and ensuring device
has seen a posted write, are related but different things, no?  So if
you are disabling an interrupt and want to be really sure interrupts
are off when function returns, I think you still want the read.

Anyway, it's your driver, just pointing out that the "IMPORTANT"
read might still be important to someone.
Peter Oh Jan. 28, 2015, 5:39 a.m. UTC | #4
On 01/27/2015 08:30 PM, Bob Copeland wrote:
> On Tue, Jan 27, 2015 at 03:53:00PM -0800, Peter Oh wrote:
>>>> -	/* IMPORTANT: this extra read transaction is required to
>>>> -	 * flush the posted write buffer. */
>>>> -	(void)ath10k_pci_read32(ar, SOC_CORE_BASE_ADDRESS +
>>>> -				PCIE_INTR_ENABLE_ADDRESS);
>>>> +	/* invoke data sync barrier */
>>>> +	wmb();
>>>>   }
>>> I am no expert in arcane PCI matters, but that looks suspicious to me.  I seem
>>> to recall wmb() only enforced ordering, and maybe not even memory-IO ordering
>>> on all platforms.  If you want to disable an irq, it really seems like you
>>> would want to flush posted writes so you know the hardware has seen it.
>> enforced ordering is happened by flush write buffer and wmb is
>> commonly used to flush write buffer.
>> so that wmb guarantees ordering by flush write buffer. That's why
>> it's called a memory barrier.
> Ok, sure, but I/O ordering two different writes, and ensuring device
> has seen a posted write, are related but different things, no?
yes, they are different and wmb guarantees both.
>    So if
> you are disabling an interrupt and want to be really sure interrupts
> are off when function returns, I think you still want the read.
the mechanism that  read is using to make sure write work done is the 
same mechanism that wmb is using.
> Anyway, it's your driver, just pointing out that the "IMPORTANT"
> read might still be important to someone.
I really appreciate your opinion.
>
Regards,
Peter
Johannes Berg Jan. 28, 2015, 7:37 a.m. UTC | #5
On Tue, 2015-01-27 at 21:39 -0800, Peter Oh wrote:

> > Ok, sure, but I/O ordering two different writes, and ensuring device
> > has seen a posted write, are related but different things, no?
> yes, they are different and wmb guarantees both.

No, wmb() doesn't. I'd be very surprised if it had any side effect on
the PCI bus even on ARM. Read Documentation/memory-barriers.txt.

And to understand (PCI) posted writes, wikipedia helps:
http://en.wikipedia.org/wiki/Posted_write

johannes
Peter Oh Jan. 30, 2015, 10:53 p.m. UTC | #6
Hi,

On 01/27/2015 11:37 PM, Johannes Berg wrote:
> On Tue, 2015-01-27 at 21:39 -0800, Peter Oh wrote:
>
>>> Ok, sure, but I/O ordering two different writes, and ensuring device
>>> has seen a posted write, are related but different things, no?
>> yes, they are different and wmb guarantees both.
> No, wmb() doesn't. I'd be very surprised if it had any side effect on
> the PCI bus even on ARM. Read Documentation/memory-barriers.txt.
>
> And to understand (PCI) posted writes, wikipedia helps:
> http://en.wikipedia.org/wiki/Posted_write
I admit that I/O ordering and posted write are looked different in 
theory and at glance since posted write could be related cache invalidate.
But wmb are still related to both.
As I addressed wmb uses dsb (in arm arch) and here is the description of 
arm architecture.

* DSB drains write buffer.
* DSB is architecturally defined to include all cache, TLB and branch 
prediction maintenance operations as well as explicit memory operations

These are the reasons why I mentioned wmb does both.

* captured from ARMv7 Architecture Manual
--- Notes ---
Historically, this operation was referred to as Drain Write Buffer or 
Data Write Barrier (DWB). From ARMv6, these
names and the use of DWB were deprecated in favor of the new Data 
Synchronization Barrier name and DSB
abbreviation. DSB better reflects the functionality provided from ARMv6, 
because DSB is architecturally defined
to include all cache, TLB and branch prediction maintenance operations 
as well as explicit memory operations

--- A DSB completes when: ---
? all explicit memory accesses that are observed by Pe before the DSB is 
executed, are of the required access
types, and are from observers in the same required shareability domain 
as Pe, are complete for the set of
observers in the required shareability domain.
? if the required accesses types of the DSB is reads and writes, all 
cache and branch predictor maintenance
operations issued by Pe before the DSB are complete for the required 
shareability domain.
? if the required accesses types of the DSB is reads and writes, all TLB 
maintenance operations issued by Pe
before the DSB are complete for the required shareability domain.
--------------

Furthermore this is the comparison of the compiled assembly code between 
ath10k_pci_read32 and wmb.

ath10k_pci_read32()
      bac:    e5932008     ldr    r2, [r3, #8]
      bb0:    f57ff04f     dsb    sy
      bb4:    e2883d52     add    r3, r8, #5248    ; 0x1480
      bb8:    e283303c     add    r3, r3, #60    ; 0x3c
      bbc:    e593300c     ldr    r3, [r3, #12]
      bc0:    e2833a09     add    r3, r3, #36864    ; 0x9000

wmb();
      b9c:    f57ff04e     dsb    st

ath10k_pci_read32 does register operation except dsb and there is no 
cache invalidate related commands.

So that if wmb is not enough for the purpose then ath10k_pci_read32 is 
also not enough for that.

Also refer the section "ACQUIRES VS I/O ACCESSES" in memory-barriers.txt.

It gives an example with PCI bridge and introduces readl as an 
alternative method to mmiowb which weaker form of wmb.

Please give your opinion.
> johannes
>
>
> _______________________________________________
> ath10k mailing list
> ath10k@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/ath10k
Regards,
Peter
Sujith Manoharan Jan. 31, 2015, 1:16 a.m. UTC | #7
Peter Oh wrote:
> As I addressed wmb uses dsb (in arm arch) and here is the description of 
> arm architecture.

What about other architectures ?

Sujith
Peter Oh Jan. 31, 2015, 1:56 a.m. UTC | #8
On 01/30/2015 05:16 PM, Sujith Manoharan wrote:
> Peter Oh wrote:
>> As I addressed wmb uses dsb (in arm arch) and here is the description of
>> arm architecture.
> What about other architectures ?
Please refer the email thread that I mentioned about other architectures.
(dsb is for ARM and other platforms have the equivalent instruction such 
as sfence, sync, mf, and dcs).
Also the patch is updated with 2nd patch set replacing wmb to mb.
> Sujith
>
> _______________________________________________
> ath10k mailing list
> ath10k@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/ath10k
Sujith Manoharan Jan. 31, 2015, 2:06 a.m. UTC | #9
Peter Oh wrote:
> Please refer the email thread that I mentioned about other architectures.
> (dsb is for ARM and other platforms have the equivalent instruction such 
> as sfence, sync, mf, and dcs).

Ok.

> Also the patch is updated with 2nd patch set replacing wmb to mb.

Would be good to test this on a MIPS platform...

Sujith
Johannes Berg Feb. 2, 2015, 1:02 p.m. UTC | #10
On Fri, 2015-01-30 at 14:53 -0800, Peter Oh wrote:

> I admit that I/O ordering and posted write are looked different in 
> theory and at glance since posted write could be related cache invalidate.
> But wmb are still related to both.
> As I addressed wmb uses dsb (in arm arch) and here is the description of 
> arm architecture.
> 
> * DSB drains write buffer.
> * DSB is architecturally defined to include all cache, TLB and branch 
> prediction maintenance operations as well as explicit memory operations
> 
> These are the reasons why I mentioned wmb does both.
> 
> * captured from ARMv7 Architecture Manual
> --- Notes ---
> Historically, this operation was referred to as Drain Write Buffer or 
> Data Write Barrier (DWB). From ARMv6, these
> names and the use of DWB were deprecated in favor of the new Data 
> Synchronization Barrier name and DSB
> abbreviation. DSB better reflects the functionality provided from ARMv6, 
> because DSB is architecturally defined
> to include all cache, TLB and branch prediction maintenance operations 
> as well as explicit memory operations
> 
> --- A DSB completes when: ---
> ? all explicit memory accesses that are observed by Pe before the DSB is 
> executed, are of the required access
> types, and are from observers in the same required shareability domain 
> as Pe, are complete for the set of
> observers in the required shareability domain.
> ? if the required accesses types of the DSB is reads and writes, all 
> cache and branch predictor maintenance
> operations issued by Pe before the DSB are complete for the required 
> shareability domain.
> ? if the required accesses types of the DSB is reads and writes, all TLB 
> maintenance operations issued by Pe
> before the DSB are complete for the required shareability domain.
> --------------

I cannot read from this in any way that it can post writes to the PCIe
bus. In fact, architecturally, I cannot think of any reason how it even
could do that from the CPU.

> Furthermore this is the comparison of the compiled assembly code between 
> ath10k_pci_read32 and wmb.
> 
> ath10k_pci_read32()
>       bac:    e5932008     ldr    r2, [r3, #8]
>       bb0:    f57ff04f     dsb    sy
>       bb4:    e2883d52     add    r3, r8, #5248    ; 0x1480
>       bb8:    e283303c     add    r3, r3, #60    ; 0x3c
>       bbc:    e593300c     ldr    r3, [r3, #12]
>       bc0:    e2833a09     add    r3, r3, #36864    ; 0x9000
> 
> wmb();
>       b9c:    f57ff04e     dsb    st
> 
> ath10k_pci_read32 does register operation except dsb and there is no 
> cache invalidate related commands.

I don't think this is relevant. The question is "what are you trying to
achieve".

> So that if wmb is not enough for the purpose then ath10k_pci_read32 is 
> also not enough for that.
> 
> Also refer the section "ACQUIRES VS I/O ACCESSES" in memory-barriers.txt.
> 
> It gives an example with PCI bridge and introduces readl as an 
> alternative method to mmiowb which weaker form of wmb.
> 
> Please give your opinion.

Again - the question is - what are you trying to achieve?

The code (as it is before your patch) implies that it's trying to make
sure that before it continues, any previous writes to the PCIe device's
registers are posted. The only way to ensure that is to do a read to the
registers, as the code does now.

What you're describing is something else entirely - you're describing a
way to make sure that some data was flushed out to DRAM from the CPU
caches.

These two things are not related in any way.

In an interrupt routine, it would make sense to ensure that the write
was posted (e.g. to mask interrupts, or to acknowledge them, or similar,
before the routine can be re-invoked.)

To me, flushing memory writes to DRAM makes less sense in an interrupt
handlers unless the device was somehow using DMA to coordinate
interrupts [1], which seems unlikely but I haven't checked.

Anyway - I have no particular interest in this discussion, I was merely
trying to help you out with this :) You can make whatever change you
want, of course :P

johannes

[1] incidentally, our device [iwlwifi] does in fact do something like
that, but it's read-only for the driver so no need for such a thing
either
Peter Oh Feb. 2, 2015, 5:25 p.m. UTC | #11
On 01/30/2015 06:06 PM, Sujith Manoharan wrote:
> Peter Oh wrote:
>> Please refer the email thread that I mentioned about other
> architectures.
>> (dsb is for ARM and other platforms have the equivalent instruction such
>> as sfence, sync, mf, and dcs).
> Ok.
>
>> Also the patch is updated with 2nd patch set replacing wmb to mb.
> Would be good to test this on a MIPS platform...
Agree. I'll do the investigation and test.
> Sujith
>
> _______________________________________________
> ath10k mailing list
> ath10k@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/ath10k
Thanks,
Peter
Peter Oh Feb. 2, 2015, 5:33 p.m. UTC | #12
On 02/02/2015 05:02 AM, Johannes Berg wrote:
> On Fri, 2015-01-30 at 14:53 -0800, Peter Oh wrote:
>
>> I admit that I/O ordering and posted write are looked different in
>> theory and at glance since posted write could be related cache
> invalidate.
>> But wmb are still related to both.
>> As I addressed wmb uses dsb (in arm arch) and here is the description of
>> arm architecture.
>>
>> * DSB drains write buffer.
>> * DSB is architecturally defined to include all cache, TLB and branch
>> prediction maintenance operations as well as explicit memory operations
>>
>> These are the reasons why I mentioned wmb does both.
>>
>> * captured from ARMv7 Architecture Manual
>> --- Notes ---
>> Historically, this operation was referred to as Drain Write Buffer or
>> Data Write Barrier (DWB). From ARMv6, these
>> names and the use of DWB were deprecated in favor of the new Data
>> Synchronization Barrier name and DSB
>> abbreviation. DSB better reflects the functionality provided from ARMv6,
>> because DSB is architecturally defined
>> to include all cache, TLB and branch prediction maintenance operations
>> as well as explicit memory operations
>>
>> --- A DSB completes when: ---
>> ? all explicit memory accesses that are observed by Pe before the DSB is
>> executed, are of the required access
>> types, and are from observers in the same required shareability domain
>> as Pe, are complete for the set of
>> observers in the required shareability domain.
>> ? if the required accesses types of the DSB is reads and writes, all
>> cache and branch predictor maintenance
>> operations issued by Pe before the DSB are complete for the required
>> shareability domain.
>> ? if the required accesses types of the DSB is reads and writes, all TLB
>> maintenance operations issued by Pe
>> before the DSB are complete for the required shareability domain.
>> --------------
> I cannot read from this in any way that it can post writes to the PCIe
> bus. In fact, architecturally, I cannot think of any reason how it even
> could do that from the CPU.
>
>> Furthermore this is the comparison of the compiled assembly code between
>> ath10k_pci_read32 and wmb.
>>
>> ath10k_pci_read32()
>>        bac:    e5932008     ldr    r2, [r3, #8]
>>        bb0:    f57ff04f     dsb    sy
>>        bb4:    e2883d52     add    r3, r8, #5248    ; 0x1480
>>        bb8:    e283303c     add    r3, r3, #60    ; 0x3c
>>        bbc:    e593300c     ldr    r3, [r3, #12]
>>        bc0:    e2833a09     add    r3, r3, #36864    ; 0x9000
>>
>> wmb();
>>        b9c:    f57ff04e     dsb    st
>>
>> ath10k_pci_read32 does register operation except dsb and there is no
>> cache invalidate related commands.
> I don't think this is relevant. The question is "what are you trying to
> achieve".
>
>> So that if wmb is not enough for the purpose then ath10k_pci_read32 is
>> also not enough for that.
>>
>> Also refer the section "ACQUIRES VS I/O ACCESSES" in
> memory-barriers.txt.
>> It gives an example with PCI bridge and introduces readl as an
>> alternative method to mmiowb which weaker form of wmb.
>>
>> Please give your opinion.
> Again - the question is - what are you trying to achieve?
>
> The code (as it is before your patch) implies that it's trying to make
> sure that before it continues, any previous writes to the PCIe device's
> registers are posted. The only way to ensure that is to do a read to the
> registers, as the code does now.
Do you know how the read ensure that although the read code does not 
check the return value?
Can you explain how a read ensures that posted write reaches PCIe device?
> What you're describing is something else entirely - you're describing a
> way to make sure that some data was flushed out to DRAM from the CPU
> caches.
>
> These two things are not related in any way.
>
> In an interrupt routine, it would make sense to ensure that the write
> was posted (e.g. to mask interrupts, or to acknowledge them, or similar,
> before the routine can be re-invoked.)
>
> To me, flushing memory writes to DRAM makes less sense in an interrupt
> handlers unless the device was somehow using DMA to coordinate
> interrupts [1], which seems unlikely but I haven't checked.
>
> Anyway - I have no particular interest in this discussion, I was merely
> trying to help you out with this :) You can make whatever change you
> want, of course :P
>
> johannes
>
> [1] incidentally, our device [iwlwifi] does in fact do something like
> that, but it's read-only for the driver so no need for such a thing
> either
>
>
> _______________________________________________
> ath10k mailing list
> ath10k@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/ath10k
Thanks,
Peter
Johannes Berg Feb. 2, 2015, 6:54 p.m. UTC | #13
On Mon, 2015-02-02 at 09:33 -0800, Peter Oh wrote:

> > The code (as it is before your patch) implies that it's trying to make
> > sure that before it continues, any previous writes to the PCIe device's
> > registers are posted. The only way to ensure that is to do a read to the
> > registers, as the code does now.

> Do you know how the read ensure that although the read code does not 
> check the return value?
> Can you explain how a read ensures that posted write reaches PCIe device?

You basically have the following sequence:

iowrite()
ioread()

If you look, you'll see that iowrite() is actually done (or should be,
or perhaps with appropriate syncs) on an uncached mapping. As a result,
the only thing you care about here is the PCIe bus, not the CPU cache
flush. And from there on that's just a question of PCIe bus semantics.

johannes
Peter Oh Feb. 2, 2015, 7:15 p.m. UTC | #14
On 02/02/2015 10:54 AM, Johannes Berg wrote:
> On Mon, 2015-02-02 at 09:33 -0800, Peter Oh wrote:
>
>>> The code (as it is before your patch) implies that it's trying to make
>>> sure that before it continues, any previous writes to the PCIe
> device's
>>> registers are posted. The only way to ensure that is to do a read to
> the
>>> registers, as the code does now.
>> Do you know how the read ensure that although the read code does not
>> check the return value?
>> Can you explain how a read ensures that posted write reaches PCIe
> device?
>
> You basically have the following sequence:
>
> iowrite()
> ioread()
>
> If you look, you'll see that iowrite() is actually done (or should be,
> or perhaps with appropriate syncs) on an uncached mapping.
since it's mmio, iowrite will be map to write, not out which is cached 
mapping.
That's why we address "posted write" here.
If it's un-cached mapping which is volatile, we don't even need ioread.
>   As a result,
> the only thing you care about here is the PCIe bus, not the CPU cache
> flush. And from there on that's just a question of PCIe bus semantics.
So how does ioread guarantee PCIe bus transaction done?
>
> johannes
>
>
> _______________________________________________
> ath10k mailing list
> ath10k@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/ath10k
Regards,
Peter
Johannes Berg Feb. 2, 2015, 7:22 p.m. UTC | #15
> > You basically have the following sequence:
> >
> > iowrite()
> > ioread()
> >
> > If you look, you'll see that iowrite() is actually done (or should be,
> > or perhaps with appropriate syncs) on an uncached mapping.
> since it's mmio, iowrite will be map to write, not out which is cached 
> mapping.
> That's why we address "posted write" here.
> If it's un-cached mapping which is volatile, we don't even need ioread.

No, this isn't true - "posted write" in the context of this discussion
is about the PCIe bus. Memory writes that go through cache aren't
referred to as "posted writes", those are just (cached) memory writes.

> >   As a result,
> > the only thing you care about here is the PCIe bus, not the CPU cache
> > flush. And from there on that's just a question of PCIe bus semantics.
> So how does ioread guarantee PCIe bus transaction done?

That's how PCIe works, operations are serialized, and read() has to wait
for a response from the device (but write doesn't - which is "posted
write")

johannes
Peter Oh Feb. 2, 2015, 7:36 p.m. UTC | #16
On 02/02/2015 11:22 AM, Johannes Berg wrote:
>>> You basically have the following sequence:
>>>
>>> iowrite()
>>> ioread()
>>>
>>> If you look, you'll see that iowrite() is actually done (or should be,
>>> or perhaps with appropriate syncs) on an uncached mapping.
>> since it's mmio, iowrite will be map to write, not out which is cached
>> mapping.
>> That's why we address "posted write" here.
>> If it's un-cached mapping which is volatile, we don't even need ioread.
> No, this isn't true - "posted write" in the context of this discussion
> is about the PCIe bus. Memory writes that go through cache aren't
> referred to as "posted writes", those are just (cached) memory writes.
>
>>>    As a result,
>>> the only thing you care about here is the PCIe bus, not the CPU cache
>>> flush. And from there on that's just a question of PCIe bus semantics.
>> So how does ioread guarantee PCIe bus transaction done?
> That's how PCIe works, operations are serialized, and read() has to wait
> for a response from the device
do you know which mechanism or which instruction set makes read() wait 
for a response from the device?
>   (but write doesn't - which is "posted
> write")
>
> johannes
>
Johannes Berg Feb. 2, 2015, 7:47 p.m. UTC | #17
On Mon, 2015-02-02 at 11:36 -0800, Peter Oh wrote:
> On 02/02/2015 11:22 AM, Johannes Berg wrote:
> >>> You basically have the following sequence:
> >>>
> >>> iowrite()
> >>> ioread()
> >>>
> >>> If you look, you'll see that iowrite() is actually done (or should be,
> >>> or perhaps with appropriate syncs) on an uncached mapping.
> >> since it's mmio, iowrite will be map to write, not out which is cached
> >> mapping.
> >> That's why we address "posted write" here.
> >> If it's un-cached mapping which is volatile, we don't even need ioread.
> > No, this isn't true - "posted write" in the context of this discussion
> > is about the PCIe bus. Memory writes that go through cache aren't
> > referred to as "posted writes", those are just (cached) memory writes.
> >
> >>>    As a result,
> >>> the only thing you care about here is the PCIe bus, not the CPU cache
> >>> flush. And from there on that's just a question of PCIe bus semantics.
> >> So how does ioread guarantee PCIe bus transaction done?
> > That's how PCIe works, operations are serialized, and read() has to wait
> > for a response from the device
> do you know which mechanism or which instruction set makes read() wait 
> for a response from the device?

I have no idea. I assume it's just like a DRAM read, the CPU stalls
while there's no response.

johannes
Peter Oh Feb. 2, 2015, 10:06 p.m. UTC | #18
On 02/02/2015 11:47 AM, Johannes Berg wrote:
> On Mon, 2015-02-02 at 11:36 -0800, Peter Oh wrote:
>> On 02/02/2015 11:22 AM, Johannes Berg wrote:
>>>>> You basically have the following sequence:
>>>>>
>>>>> iowrite()
>>>>> ioread()
>>>>>
>>>>> If you look, you'll see that iowrite() is actually done (or should
> be,
>>>>> or perhaps with appropriate syncs) on an uncached mapping.
>>>> since it's mmio, iowrite will be map to write, not out which is
> cached
>>>> mapping.
>>>> That's why we address "posted write" here.
>>>> If it's un-cached mapping which is volatile, we don't even need
> ioread.
>>> No, this isn't true - "posted write" in the context of this discussion
>>> is about the PCIe bus. Memory writes that go through cache aren't
>>> referred to as "posted writes", those are just (cached) memory writes.
>>>
>>>>>     As a result,
>>>>> the only thing you care about here is the PCIe bus, not the CPU
> cache
>>>>> flush. And from there on that's just a question of PCIe bus
> semantics.
>>>> So how does ioread guarantee PCIe bus transaction done?
>>> That's how PCIe works, operations are serialized, and read() has to
> wait
>>> for a response from the device
>> do you know which mechanism or which instruction set makes read() wait
>> for a response from the device?
> I have no idea. I assume it's just like a DRAM read, the CPU stalls
> while there's no response.
My explanation in this thread is all about how read() guarantees the 
wait for a response from the device, therefore why mb() - replace from 
wmb at patch set 2 - is compatible to read().
Briefly speaking,
read() -> dsb 'st' -> cpu (actually axi master in cpu) holding axi bus 
-> cpu post write buffer on axi bus -> axi bus (axi slave which is PCIe 
device) signals write completion when write transactions completed in 
write response channel ->  cpu release axi bus -> cpu program counter 
(pc) proceeds the next to read.

the exact same routines happen with mb().
mb() -> dsb 'st' -> cpu (actually axi master in cpu) holding axi bus -> 
cpu post write buffer on axi bus -> axi bus (axi slave which is PCIe 
device) signals write completion when write transactions completed in 
write response channel ->  cpu release axi bus -> cpu program counter 
(pc) proceeds the next to read.

Since axi bus master is waiting (blocking) for write completion signal 
from axi slave (PCIe device), this is how read() and mb() guarantee 
write command reaches to the device.
> johannes
>
>
> _______________________________________________
> ath10k mailing list
> ath10k@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/ath10k
Regards,
Peter
Peter Oh Feb. 2, 2015, 10:15 p.m. UTC | #19
On 02/02/2015 02:06 PM, Peter Oh wrote:
>
> On 02/02/2015 11:47 AM, Johannes Berg wrote:
>> On Mon, 2015-02-02 at 11:36 -0800, Peter Oh wrote:
>>> On 02/02/2015 11:22 AM, Johannes Berg wrote:
>>>>>> You basically have the following sequence:
>>>>>>
>>>>>> iowrite()
>>>>>> ioread()
>>>>>>
>>>>>> If you look, you'll see that iowrite() is actually done (or should
>> be,
>>>>>> or perhaps with appropriate syncs) on an uncached mapping.
>>>>> since it's mmio, iowrite will be map to write, not out which is
>> cached
>>>>> mapping.
>>>>> That's why we address "posted write" here.
>>>>> If it's un-cached mapping which is volatile, we don't even need
>> ioread.
>>>> No, this isn't true - "posted write" in the context of this discussion
>>>> is about the PCIe bus. Memory writes that go through cache aren't
>>>> referred to as "posted writes", those are just (cached) memory writes.
>>>>
>>>>>>     As a result,
>>>>>> the only thing you care about here is the PCIe bus, not the CPU
>> cache
>>>>>> flush. And from there on that's just a question of PCIe bus
>> semantics.
>>>>> So how does ioread guarantee PCIe bus transaction done?
>>>> That's how PCIe works, operations are serialized, and read() has to
>> wait
>>>> for a response from the device
>>> do you know which mechanism or which instruction set makes read() wait
>>> for a response from the device?
>> I have no idea. I assume it's just like a DRAM read, the CPU stalls
>> while there's no response.
> My explanation in this thread is all about how read() guarantees the 
> wait for a response from the device, therefore why mb() - replace from 
> wmb at patch set 2 - is compatible to read().
> Briefly speaking,
> read() -> dsb 'st' -> cpu (actually axi master in cpu) holding axi bus 
> -> cpu post write buffer on axi bus -> axi bus (axi slave which is 
> PCIe device) signals write completion when write transactions 
> completed in write response channel ->  cpu release axi bus -> cpu 
> program counter (pc) proceeds the next to read.
>
> the exact same routines happen with mb().
> mb() -> dsb 'st' -> cpu (actually axi master in cpu) holding axi bus 
> -> cpu post write buffer on axi bus -> axi bus (axi slave which is 
> PCIe device) signals write completion when write transactions 
> completed in write response channel ->  cpu release axi bus -> cpu 
> program counter (pc) proceeds the next to read.
read -> mb (erratum)
>
> Since axi bus master is waiting (blocking) for write completion signal 
> from axi slave (PCIe device), this is how read() and mb() guarantee 
> write command reaches to the device.
>> johannes
>>
>>
>> _______________________________________________
>> ath10k mailing list
>> ath10k@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/ath10k
> Regards,
> Peter
>
> _______________________________________________
> ath10k mailing list
> ath10k@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/ath10k
Adrian Chadd Feb. 2, 2015, 10:26 p.m. UTC | #20
On 30 January 2015 at 18:06, Sujith Manoharan <sujith@msujith.org> wrote:
> Peter Oh wrote:
>> Please refer the email thread that I mentioned about other architectures.
>> (dsb is for ARM and other platforms have the equivalent instruction such
>> as sfence, sync, mf, and dcs).
>
> Ok.
>
>> Also the patch is updated with 2nd patch set replacing wmb to mb.
>
> Would be good to test this on a MIPS platform...
>

The Atheros mips74k stuff I have here does /not/ flush all the writes
out to the device and guarantee the device has seen everything with a
memory barrier. Just saying. Various drivers ended up needing
ioread()s in my experiments; mips sync operations weren't enough.

So I'd suggest abstracting it out like the linux dri i915 code has -
they define a "posting read" macro which they use whenever they need
to ensure it's definitely made it all the way out to the hardware and
through internal FIFOs so internal hardware has seen the state change.
Then you can redefine that to your hearts content based on platform.



-adrian
Peter Oh Feb. 2, 2015, 11:04 p.m. UTC | #21
On 02/02/2015 02:26 PM, Adrian Chadd wrote:
> On 30 January 2015 at 18:06, Sujith Manoharan <sujith@msujith.org> wrote:
>> Peter Oh wrote:
>>> Please refer the email thread that I mentioned about other architectures.
>>> (dsb is for ARM and other platforms have the equivalent instruction such
>>> as sfence, sync, mf, and dcs).
>> Ok.
>>
>>> Also the patch is updated with 2nd patch set replacing wmb to mb.
>> Would be good to test this on a MIPS platform...
>>
> The Atheros mips74k stuff I have here does /not/ flush all the writes
> out to the device and guarantee the device has seen everything with a
> memory barrier. Just saying. Various drivers ended up needing
> ioread()s in my experiments; mips sync operations weren't enough.
>
> So I'd suggest abstracting it out like the linux dri i915 code has -
> they define a "posting read" macro which they use whenever they need
> to ensure it's definitely made it all the way out to the hardware and
> through internal FIFOs so internal hardware has seen the state change.
> Then you can redefine that to your hearts content based on platform.
Thank you Adrian to head up the concern and suggestion.
The other people also concerned about other architectures like mips, so 
I was going to analysis it.
But since you've already experienced the defects, let me hold the change 
back until I find better for all.
>
>
> -adrian
Regards,
Peter
Florian Fainelli Feb. 2, 2015, 11:25 p.m. UTC | #22
On 02/02/15 14:06, Peter Oh wrote:
> 
> On 02/02/2015 11:47 AM, Johannes Berg wrote:
>> On Mon, 2015-02-02 at 11:36 -0800, Peter Oh wrote:
>>> On 02/02/2015 11:22 AM, Johannes Berg wrote:
>>>>>> You basically have the following sequence:
>>>>>>
>>>>>> iowrite()
>>>>>> ioread()
>>>>>>
>>>>>> If you look, you'll see that iowrite() is actually done (or should
>> be,
>>>>>> or perhaps with appropriate syncs) on an uncached mapping.
>>>>> since it's mmio, iowrite will be map to write, not out which is
>> cached
>>>>> mapping.
>>>>> That's why we address "posted write" here.
>>>>> If it's un-cached mapping which is volatile, we don't even need
>> ioread.
>>>> No, this isn't true - "posted write" in the context of this discussion
>>>> is about the PCIe bus. Memory writes that go through cache aren't
>>>> referred to as "posted writes", those are just (cached) memory writes.
>>>>
>>>>>>     As a result,
>>>>>> the only thing you care about here is the PCIe bus, not the CPU
>> cache
>>>>>> flush. And from there on that's just a question of PCIe bus
>> semantics.
>>>>> So how does ioread guarantee PCIe bus transaction done?
>>>> That's how PCIe works, operations are serialized, and read() has to
>> wait
>>>> for a response from the device
>>> do you know which mechanism or which instruction set makes read() wait
>>> for a response from the device?
>> I have no idea. I assume it's just like a DRAM read, the CPU stalls
>> while there's no response.
> My explanation in this thread is all about how read() guarantees the
> wait for a response from the device, therefore why mb() - replace from
> wmb at patch set 2 - is compatible to read().
> Briefly speaking,
> read() -> dsb 'st' -> cpu (actually axi master in cpu) holding axi bus
> -> cpu post write buffer on axi bus -> axi bus (axi slave which is PCIe
> device) signals write completion when write transactions completed in
> write response channel ->  cpu release axi bus -> cpu program counter
> (pc) proceeds the next to read.
> 
> the exact same routines happen with mb().
> mb() -> dsb 'st' -> cpu (actually axi master in cpu) holding axi bus ->
> cpu post write buffer on axi bus -> axi bus (axi slave which is PCIe
> device) signals write completion when write transactions completed in
> write response channel ->  cpu release axi bus -> cpu program counter
> (pc) proceeds the next to read.
> 
> Since axi bus master is waiting (blocking) for write completion signal
> from axi slave (PCIe device), this is how read() and mb() guarantee
> write command reaches to the device.

PCIe writes are posted, so the only guarantee you can have by inserting
such barriers is that writes from CPU to the PCIe RC (targeting PCIe
device) is non-posted (as far as the busing between CPU and the PCIe RC
is concerned), but past the PCIe RC, there is no such guarantee, because
the PCIe specification allows for that and there is flow control, PCIe
switches or other things that can alter the way your PCIe device ends-up
being written to.

The only way to make a "portable" synchronization barrier is to do a
PCIe read from the same register you just wrote to, because then, the
PCIe RC needs to guarantee the transaction ordering on the PCIe bus itself.

You might just be lucky and/or have very good HW which ensures that the
ARM synchronization barriers are propagated to the memory region where
your PCIe device BARs are mapped from the CPU perspective, but you
definitively cannot rely on such assumptions, as there will be bogus HW
there, for which only a subsequent ioread32() will work.
Peter Oh Feb. 2, 2015, 11:49 p.m. UTC | #23
Hi Florian,

Very appreciate your explanation in detail.

Regards,
Peter
On 02/02/2015 03:25 PM, Florian Fainelli wrote:
> On 02/02/15 14:06, Peter Oh wrote:
>> On 02/02/2015 11:47 AM, Johannes Berg wrote:
>>> On Mon, 2015-02-02 at 11:36 -0800, Peter Oh wrote:
>>>> On 02/02/2015 11:22 AM, Johannes Berg wrote:
>>>>>>> You basically have the following sequence:
>>>>>>>
>>>>>>> iowrite()
>>>>>>> ioread()
>>>>>>>
>>>>>>> If you look, you'll see that iowrite() is actually done (or should
>>> be,
>>>>>>> or perhaps with appropriate syncs) on an uncached mapping.
>>>>>> since it's mmio, iowrite will be map to write, not out which is
>>> cached
>>>>>> mapping.
>>>>>> That's why we address "posted write" here.
>>>>>> If it's un-cached mapping which is volatile, we don't even need
>>> ioread.
>>>>> No, this isn't true - "posted write" in the context of this discussion
>>>>> is about the PCIe bus. Memory writes that go through cache aren't
>>>>> referred to as "posted writes", those are just (cached) memory writes.
>>>>>
>>>>>>>      As a result,
>>>>>>> the only thing you care about here is the PCIe bus, not the CPU
>>> cache
>>>>>>> flush. And from there on that's just a question of PCIe bus
>>> semantics.
>>>>>> So how does ioread guarantee PCIe bus transaction done?
>>>>> That's how PCIe works, operations are serialized, and read() has to
>>> wait
>>>>> for a response from the device
>>>> do you know which mechanism or which instruction set makes read() wait
>>>> for a response from the device?
>>> I have no idea. I assume it's just like a DRAM read, the CPU stalls
>>> while there's no response.
>> My explanation in this thread is all about how read() guarantees the
>> wait for a response from the device, therefore why mb() - replace from
>> wmb at patch set 2 - is compatible to read().
>> Briefly speaking,
>> read() -> dsb 'st' -> cpu (actually axi master in cpu) holding axi bus
>> -> cpu post write buffer on axi bus -> axi bus (axi slave which is PCIe
>> device) signals write completion when write transactions completed in
>> write response channel ->  cpu release axi bus -> cpu program counter
>> (pc) proceeds the next to read.
>>
>> the exact same routines happen with mb().
>> mb() -> dsb 'st' -> cpu (actually axi master in cpu) holding axi bus ->
>> cpu post write buffer on axi bus -> axi bus (axi slave which is PCIe
>> device) signals write completion when write transactions completed in
>> write response channel ->  cpu release axi bus -> cpu program counter
>> (pc) proceeds the next to read.
>>
>> Since axi bus master is waiting (blocking) for write completion signal
>> from axi slave (PCIe device), this is how read() and mb() guarantee
>> write command reaches to the device.
> PCIe writes are posted, so the only guarantee you can have by inserting
> such barriers is that writes from CPU to the PCIe RC (targeting PCIe
> device) is non-posted (as far as the busing between CPU and the PCIe RC
> is concerned), but past the PCIe RC, there is no such guarantee, because
> the PCIe specification allows for that and there is flow control, PCIe
> switches or other things that can alter the way your PCIe device ends-up
> being written to.
>
> The only way to make a "portable" synchronization barrier is to do a
> PCIe read from the same register you just wrote to, because then, the
> PCIe RC needs to guarantee the transaction ordering on the PCIe bus itself.
>
> You might just be lucky and/or have very good HW which ensures that the
> ARM synchronization barriers are propagated to the memory region where
> your PCIe device BARs are mapped from the CPU perspective, but you
> definitively cannot rely on such assumptions, as there will be bogus HW
> there, for which only a subsequent ioread32() will work.
diff mbox

Patch

diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
index 3b40a86..c353a2c 100644
--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -346,10 +346,8 @@  static void ath10k_pci_disable_and_clear_legacy_irq(struct ath10k *ar)
 	ath10k_pci_write32(ar, SOC_CORE_BASE_ADDRESS + PCIE_INTR_CLR_ADDRESS,
 			   PCIE_INTR_FIRMWARE_MASK | PCIE_INTR_CE_MASK_ALL);
 
-	/* IMPORTANT: this extra read transaction is required to
-	 * flush the posted write buffer. */
-	(void)ath10k_pci_read32(ar, SOC_CORE_BASE_ADDRESS +
-				PCIE_INTR_ENABLE_ADDRESS);
+	/* invoke data sync barrier */
+	wmb();
 }
 
 static void ath10k_pci_enable_legacy_irq(struct ath10k *ar)
@@ -358,10 +356,8 @@  static void ath10k_pci_enable_legacy_irq(struct ath10k *ar)
 			   PCIE_INTR_ENABLE_ADDRESS,
 			   PCIE_INTR_FIRMWARE_MASK | PCIE_INTR_CE_MASK_ALL);
 
-	/* IMPORTANT: this extra read transaction is required to
-	 * flush the posted write buffer. */
-	(void)ath10k_pci_read32(ar, SOC_CORE_BASE_ADDRESS +
-				PCIE_INTR_ENABLE_ADDRESS);
+	/* invoke data sync barrier */
+	wmb();
 }
 
 static inline const char *ath10k_pci_get_irq_method(struct ath10k *ar)