Message ID | 20180224223018.9731-1-jsmart2021@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
On Sat, Feb 24, 2018 at 11:30 PM, James Smart <jsmart2021@gmail.com> wrote: > writeq() is not present on all 32-bit architectures. > > When 32-bit, use writel() > > Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> > Signed-off-by: James Smart <james.smart@broadcom.com> > --- > drivers/scsi/lpfc/lpfc_sli.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/drivers/scsi/lpfc/lpfc_sli.c b/drivers/scsi/lpfc/lpfc_sli.c > index 4ce3ca6f4b79..f2bf3bf93aa6 100644 > --- a/drivers/scsi/lpfc/lpfc_sli.c > +++ b/drivers/scsi/lpfc/lpfc_sli.c > @@ -141,8 +141,13 @@ lpfc_sli4_wq_put(struct lpfc_queue *q, union lpfc_wqe *wqe) > if (q->dpp_enable && q->phba->cfg_enable_dpp) { > /* write to DPP aperture taking advatage of Combined Writes */ > tmp = (uint8_t *)wqe; > +#ifdef CONFIG_64BIT > for (i = 0; i < q->entry_size; i += sizeof(uint64_t)) > writeq(*((uint64_t *)(tmp + i)), q->dpp_regaddr + i); > +#else > + for (i = 0; i < q->entry_size; i += sizeof(uint32_t)) > + writel(*((uint32_t *)(tmp + i)), q->dpp_regaddr + i); > +#endif > } Unfortunately, this is still broken on all big-endian architectures. You could use __raw_writeq() here to fix it, or change the if() clause at the beginning to include '!IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)' to avoid that. Arnd
On 2/26/2018 12:36 AM, Arnd Bergmann wrote: > Unfortunately, this is still broken on all big-endian architectures. You could > use __raw_writeq() here to fix it, or change the if() clause at the beginning > to include '!IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)' to avoid that. > > Arnd Please explain. There doesn't seem to be a definitive guide on which of the several routines for bar mapping and the several routines for memory copying should be used. We've gone with what we tested and measured. Also, this code is disabled on everything but X86 right now - so the patch should be fine. We settled on the combination which our testing showed the best results. We tried other architectures, but there wasn't a combination that showed great results. If we figure a combination out, we will update the code. Note: when we used ioremap_wc() both PPC/BE and X86 dropped in overall performance as it seems to make things cacheable and the speculation is the caching aspects delayed the results just enough to make the overall gain go down. -- james
On Mon, Feb 26, 2018 at 6:01 PM, James Smart <jsmart2021@gmail.com> wrote: > > > On 2/26/2018 12:36 AM, Arnd Bergmann wrote: >> >> Unfortunately, this is still broken on all big-endian architectures. You >> could >> use __raw_writeq() here to fix it, or change the if() clause at the >> beginning >> to include '!IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)' to avoid that. >> >> Arnd > > > Please explain. There doesn't seem to be a definitive guide on which of the > several routines for bar mapping and the several routines for memory copying > should be used. We've gone with what we tested and measured. > > Also, this code is disabled on everything but X86 right now - so the patch > should be fine. We settled on the combination which our testing showed the > best results. We tried other architectures, but there wasn't a combination > that showed great results. If we figure a combination out, we will update > the code. For the endianess, the key to understanding this is that readl/writel and readq/writeq follow the convention of accessing data as little-endian because that is what 99% of MMIO accesses on PCI are: you have a 32-bit or 64-bit register value that gets copied between a register and the CPU, and it's up to the device manufacturer to implement that convention in hardware. The Linux implementation of writeq() on all architectures therefore puts the LSB into the first byte of the output. In contrast, accessing memory using a pointer dereference is defined by the CPU manufacturer, so in a big-endian CPU the first byte corresponds to the MSB of the CPU register. In your loop for (i = 0; i < q->entry_size; i += sizeof(uint64_t)) writeq(*((uint64_t *)(tmp + i)), q->dpp_regaddr + i); that ends up copying the first byte of 'tmp[]' to byte 7 of dpp_regaddr, which makes no sense as the data in tmp is (presumably) just a stream of bytes that we want to end up on a disk in the same order, regardless of which mode the CPU happens to be in at the time we copy the data. > Note: when we used ioremap_wc() both PPC/BE and X86 dropped in overall > performance as it seems to make things cacheable and the speculation is the > caching aspects delayed the results just enough to make the overall gain go > down. ioremap_wc should never make the access cacheable, but architectures can fall back to a regular uncached mapping without write-combining when there is no hardware support for write-combining. However, when you use writeq() on powerpc, you get a very expensive barriers ("sync") before each 8 byte chunk, which purposely defeats the write-combining, plus you might get an indirect function call, depending on the kernel configuration and the hardware you are running on. Using __raw_writeq() should get both the endianess right and avoid all those barriers you really don't want for what is essentially a memcpy(). Arnd
On 2/26/2018 12:04 PM, Arnd Bergmann wrote: > For the endianess, the key to understanding this is that readl/writel and > readq/writeq follow the convention of accessing data as little-endian because > that is what 99% of MMIO accesses on PCI are: you have a 32-bit or 64-bit > register value that gets copied between a register and the CPU, and it's > up to the device manufacturer to implement that convention in hardware. > The Linux implementation of writeq() on all architectures therefore puts the > LSB into the first byte of the output. ok, so I interpret this as readx/writex() get a cpu-endian-based value, and the routine compensates for cpu endianness as it puts it out on the io bus (LSB=1st byte). All well and good. > In contrast, accessing memory using a pointer dereference is defined by > the CPU manufacturer, so in a big-endian CPU the first byte corresponds > to the MSB of the CPU register. > > In your loop > > for (i = 0; i < q->entry_size; i += sizeof(uint64_t)) > writeq(*((uint64_t *)(tmp + i)), q->dpp_regaddr + i); > > that ends up copying the first byte of 'tmp[]' to byte 7 of dpp_regaddr, > which makes no sense as the data in tmp is (presumably) just a stream of > bytes that we want to end up on a disk in the same order, regardless > of which mode the CPU happens to be in at the time we copy the data. This is where I get a bit lost. So what you're telling me is: with uint64_t a=5, *b; *b=5; that writeq(a, regptr); is not the same as writeq(*b, regptr); ?? That doesn't make sense to me as "*ptr" should have resulted in a cpu-endian value being given to writeq(), ignorant of how that value came into existence (pointer/constant/value in register). Then writeq() does it's thing to ensure LSB=1st byte It would only make sense if writeq() can be a macro that such that it can resolve into something that is essentially *regport = *b; thus it's treated as a bytestream starting at address b and does not necessarily compensate for the cpu-endianness of the multi-byte quantity. -- james
On Mon, Feb 26, 2018 at 10:41 PM, James Smart <jsmart2021@gmail.com> wrote: > On 2/26/2018 12:04 PM, Arnd Bergmann wrote: >> >> For the endianess, the key to understanding this is that readl/writel and >> readq/writeq follow the convention of accessing data as little-endian >> because >> that is what 99% of MMIO accesses on PCI are: you have a 32-bit or 64-bit >> register value that gets copied between a register and the CPU, and it's >> up to the device manufacturer to implement that convention in hardware. >> The Linux implementation of writeq() on all architectures therefore puts >> the >> LSB into the first byte of the output. > > > ok, so I interpret this as readx/writex() get a cpu-endian-based value, and > the routine compensates for cpu endianness as it puts it out on the io bus > (LSB=1st byte). All well and good. it helps to think of it as compensating for device endianess rather than compensating for CPU endianess. The effect is the same of course. >> In contrast, accessing memory using a pointer dereference is defined by >> the CPU manufacturer, so in a big-endian CPU the first byte corresponds >> to the MSB of the CPU register. >> >> In your loop >> >> for (i = 0; i < q->entry_size; i += sizeof(uint64_t)) >> writeq(*((uint64_t *)(tmp + i)), q->dpp_regaddr + >> i); >> >> that ends up copying the first byte of 'tmp[]' to byte 7 of dpp_regaddr, >> which makes no sense as the data in tmp is (presumably) just a stream of >> bytes that we want to end up on a disk in the same order, regardless >> of which mode the CPU happens to be in at the time we copy the data. > > > This is where I get a bit lost. > > So what you're telling me is: > with > uint64_t a=5, *b; > *b=5; > that > writeq(a, regptr); > is not the same as > writeq(*b, regptr); > ?? > > That doesn't make sense to me as "*ptr" should have resulted in a cpu-endian > value being given to writeq(), ignorant of how that value came into > existence (pointer/constant/value in register). Then writeq() does it's > thing to ensure LSB=1st byte > > It would only make sense if writeq() can be a macro that such that it can > resolve into something that is essentially > *regport = *b; > thus it's treated as a bytestream starting at address b and does not > necessarily compensate for the cpu-endianness of the multi-byte quantity. What you are describing above is not a byte stream but dealing with a 64-bit integer. In both cases you obviously end up with the destination data being 0x05 0x00 0x00 0x00 0x00 0x00 0x00 0x00, there is no difference. The case that you have in the driver is more like char *bytestream = "this is a byte stream"; u64 *tmp = (u64 *)bytestream; writeq(*tmp, regptr); writeq(*(tmp + 1), regptr + 1); On a big-endian cpu, *tmp is the 64-bit number 0x7468697320697320, and *(tmp +1) is 0x6120627974652073, as the access to the memory location is performed using big-endian semantics. When we write that to the PCI bus using little-endian semantics, the buffer ends up being 0x20 0x73 0x69 0x20 0x73 0x69 0x68 0x74 0x20 0x73 0x20 0x65 0x74 0x79 0x62 0x20, or " si siht s etyb a". Arnd
On 2/27/2018 12:58 AM, Arnd Bergmann wrote: > What you are describing above is not a byte stream but dealing with > a 64-bit integer. In both cases you obviously end up with the destination > data being 0x05 0x00 0x00 0x00 0x00 0x00 0x00 0x00, there is no > difference. > > The case that you have in the driver is more like > > char *bytestream = "this is a byte stream"; > u64 *tmp = (u64 *)bytestream; > writeq(*tmp, regptr); > writeq(*(tmp + 1), regptr + 1); > > On a big-endian cpu, *tmp is the 64-bit number 0x7468697320697320, > and *(tmp +1) is 0x6120627974652073, as the access to the memory > location is performed using big-endian semantics. > > When we write that to the PCI bus using little-endian semantics, the > buffer ends up being 0x20 0x73 0x69 0x20 0x73 0x69 0x68 0x74 > 0x20 0x73 0x20 0x65 0x74 0x79 0x62 0x20, or " si siht s etyb a". ok - I get what you are saying... The summary is: the writeX() routines take a cpu-endianness specific int value and send it to the device in a byte order corresponding to LSB to MSB. On LE cpu's using writex() to PCI (LE bus), LSB to MSB mirrors byte order in memory. But on BE cpu's to PCI, LSB to MSB means the byte order was effectively byteswapped as it is sent to the device. So you point out a very real concern, as in most cases the source buffer is a bytestream and the desire is to send the bytestream in the same byte order as in memory. It turns out we're somewhat lucky as the driver's source buffer wasn't a real bytestream, it was a cpu-endian relative structure so writex behavior would be ok. However, the driver is confused as it wants to use 64-bit copies, but the cpu-endian structure was based on 32-bit's at a time when mapping to the hardware. So things are ok if using writel, but not so ok if writeq. As the code stands write now - this feature and the routine is only used when x86, thus LE, thus the whole LSB conversion by writeX works. It's a mute point. Please accept the patch as proposed. We will address these other issues as we include the "push" support for other architectures. -- james
On Tue, Feb 27, 2018 at 7:05 PM, James Smart <james.smart@broadcom.com> wrote: > On 2/27/2018 12:58 AM, Arnd Bergmann wrote: > > So you point out a very real concern, as in most cases the source buffer is > a bytestream and the desire is to send the bytestream in the same byte order > as in memory. It turns out we're somewhat lucky as the driver's source > buffer wasn't a real bytestream, it was a cpu-endian relative structure so > writex behavior would be ok. However, the driver is confused as it wants to > use 64-bit copies, but the cpu-endian structure was based on 32-bit's at a > time when mapping to the hardware. So things are ok if using writel, but not > so ok if writeq. Ah, so it is a structure of only 32-bit integers? Obviously if you have 16-bit or 64-bit integers mixed in with the structure, things get even more complicated. For an array of 32-bit little-endian registers, this variant should work on all architectures: #include <linux/io-64-nonatomic-lo-hi.h> for (i = 0; i < q->entry_size; i += sizeof(uint64_t)) writeq_relaxed(*((uint32_t *)(tmp + i) | *((uint32_t *)(tmp + i + 1) << 32, (q->dpp_regaddr + i) On x86_64 and x86_32, that should do the exact same thing as your version, but it should also work efficiently on other architectures. The __raw_writeq() version I suggested earlier is not correct if we are dealing with an array of __le32 values in hardware. On powerpc, the writeq_relaxed() unfortunately still prevents write-combining, so that won't be efficient. > As the code stands write now - this feature and the routine is only used > when x86, thus LE, thus the whole LSB conversion by writeX works. It's a > mute point. > > Please accept the patch as proposed. We will address these other issues as > we include the "push" support for other architectures. Could you add an #ifdef and comment around the 'if (q->dpp_enable ...)' block then to make sure that if anybody tries to make it work on other architectures, they are aware of the problem? Arnd
On 2/27/2018 12:15 PM, Arnd Bergmann wrote: > Could you add an #ifdef and comment around the 'if (q->dpp_enable ...)' > block then to make sure that if anybody tries to make it work on other > architectures, they are aware of the problem? the ifdef is around the area where wc is enabled. I'd prefer not to add another. However, I will add a comment about the issue, which can warn people. -- james
On Tue, Feb 27, 2018 at 9:24 PM, James Smart <jsmart2021@gmail.com> wrote: > On 2/27/2018 12:15 PM, Arnd Bergmann wrote: >> >> Could you add an #ifdef and comment around the 'if (q->dpp_enable ...)' >> block then to make sure that if anybody tries to make it work on other >> architectures, they are aware of the problem? > > > the ifdef is around the area where wc is enabled. I'd prefer not to add > another. However, I will add a comment about the issue, which can warn > people. Ok, that's probably good enough then. Thanks! Arnd
diff --git a/drivers/scsi/lpfc/lpfc_sli.c b/drivers/scsi/lpfc/lpfc_sli.c index 4ce3ca6f4b79..f2bf3bf93aa6 100644 --- a/drivers/scsi/lpfc/lpfc_sli.c +++ b/drivers/scsi/lpfc/lpfc_sli.c @@ -141,8 +141,13 @@ lpfc_sli4_wq_put(struct lpfc_queue *q, union lpfc_wqe *wqe) if (q->dpp_enable && q->phba->cfg_enable_dpp) { /* write to DPP aperture taking advatage of Combined Writes */ tmp = (uint8_t *)wqe; +#ifdef CONFIG_64BIT for (i = 0; i < q->entry_size; i += sizeof(uint64_t)) writeq(*((uint64_t *)(tmp + i)), q->dpp_regaddr + i); +#else + for (i = 0; i < q->entry_size; i += sizeof(uint32_t)) + writel(*((uint32_t *)(tmp + i)), q->dpp_regaddr + i); +#endif } /* ensure WQE bcopy and DPP flushed before doorbell write */ wmb();