lpfc: correct writeq failures on 32-bit architectures

Message ID	20180224223018.9731-1-jsmart2021@gmail.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: James Smart <jsmart2021@gmail.com> To: linux-scsi@vger.kernel.org, martin.petersen@oracle.com, jthumshirn@suse.de, hare@suse.com, jejb@linux.vnet.ibm.com, arnd@arndb.de Cc: James Smart <jsmart2021@gmail.com>, Dick Kennedy <dick.kennedy@broadcom.com>, James Smart <james.smart@broadcom.com> Subject: [PATCH] lpfc: correct writeq failures on 32-bit architectures Date: Sat, 24 Feb 2018 14:30:18 -0800 Message-Id: <20180224223018.9731-1-jsmart2021@gmail.com> Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk

James Smart Feb. 24, 2018, 10:30 p.m. UTC

writeq() is not present on all 32-bit architectures.

When 32-bit, use writel()

Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
---
 drivers/scsi/lpfc/lpfc_sli.c | 5 +++++
 1 file changed, 5 insertions(+)

Johannes Thumshirn Feb. 26, 2018, 7:56 a.m. UTC | #1

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

Arnd Bergmann Feb. 26, 2018, 8:36 a.m. UTC | #2

On Sat, Feb 24, 2018 at 11:30 PM, James Smart <jsmart2021@gmail.com> wrote:
> writeq() is not present on all 32-bit architectures.
>
> When 32-bit, use writel()
>
> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
> Signed-off-by: James Smart <james.smart@broadcom.com>
> ---
>  drivers/scsi/lpfc/lpfc_sli.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/scsi/lpfc/lpfc_sli.c b/drivers/scsi/lpfc/lpfc_sli.c
> index 4ce3ca6f4b79..f2bf3bf93aa6 100644
> --- a/drivers/scsi/lpfc/lpfc_sli.c
> +++ b/drivers/scsi/lpfc/lpfc_sli.c
> @@ -141,8 +141,13 @@ lpfc_sli4_wq_put(struct lpfc_queue *q, union lpfc_wqe *wqe)
>         if (q->dpp_enable && q->phba->cfg_enable_dpp) {
>                 /* write to DPP aperture taking advatage of Combined Writes */
>                 tmp = (uint8_t *)wqe;
> +#ifdef CONFIG_64BIT
>                 for (i = 0; i < q->entry_size; i += sizeof(uint64_t))
>                         writeq(*((uint64_t *)(tmp + i)), q->dpp_regaddr + i);
> +#else
> +               for (i = 0; i < q->entry_size; i += sizeof(uint32_t))
> +                       writel(*((uint32_t *)(tmp + i)), q->dpp_regaddr + i);
> +#endif
>         }

Unfortunately, this is still broken on all big-endian architectures. You could
use __raw_writeq() here to fix it, or change the if() clause at the beginning
to include '!IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)' to avoid that.

        Arnd

James Smart Feb. 26, 2018, 5:01 p.m. UTC | #3

On 2/26/2018 12:36 AM, Arnd Bergmann wrote:
> Unfortunately, this is still broken on all big-endian architectures. You could
> use __raw_writeq() here to fix it, or change the if() clause at the beginning
> to include '!IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)' to avoid that.
> 
>          Arnd

Please explain.  There doesn't seem to be a definitive guide on which of 
the several routines for bar mapping and the several routines for memory 
copying should be used.  We've gone with what we tested and measured.

Also, this code is disabled on everything but X86 right now - so the 
patch should be fine. We settled on the combination which our testing 
showed the best results. We tried other architectures, but there wasn't 
a combination that showed great results. If we figure a combination out, 
we will update the code.

Note: when we used ioremap_wc() both PPC/BE and X86 dropped in overall 
performance as it seems to make things cacheable and the speculation is 
the caching aspects delayed the results just enough to make the overall 
gain go down.

-- james

Arnd Bergmann Feb. 26, 2018, 8:04 p.m. UTC | #4

On Mon, Feb 26, 2018 at 6:01 PM, James Smart <jsmart2021@gmail.com> wrote:
>
>
> On 2/26/2018 12:36 AM, Arnd Bergmann wrote:
>>
>> Unfortunately, this is still broken on all big-endian architectures. You
>> could
>> use __raw_writeq() here to fix it, or change the if() clause at the
>> beginning
>> to include '!IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)' to avoid that.
>>
>>          Arnd
>
>
> Please explain.  There doesn't seem to be a definitive guide on which of the
> several routines for bar mapping and the several routines for memory copying
> should be used.  We've gone with what we tested and measured.
>
> Also, this code is disabled on everything but X86 right now - so the patch
> should be fine. We settled on the combination which our testing showed the
> best results. We tried other architectures, but there wasn't a combination
> that showed great results. If we figure a combination out, we will update
> the code.

For the endianess, the key to understanding this is that readl/writel and
readq/writeq follow the convention of accessing data as little-endian because
that is what 99% of MMIO accesses on PCI are: you have a 32-bit or 64-bit
register value that gets copied between a register and the CPU, and it's
up to the device manufacturer to implement that convention in hardware.
The Linux implementation of writeq() on all architectures therefore puts the
LSB into the first byte of the output.

In contrast, accessing memory using a pointer dereference is defined by
the CPU manufacturer, so in a big-endian CPU the first byte corresponds
to the MSB of the CPU register.

In your loop

                for (i = 0; i < q->entry_size; i += sizeof(uint64_t))
                        writeq(*((uint64_t *)(tmp + i)), q->dpp_regaddr + i);

that ends up copying the first byte of 'tmp[]' to byte 7 of dpp_regaddr,
which makes no sense as the data in tmp is (presumably) just a stream of
bytes that we want to end up on a disk in the same order, regardless
of which mode the CPU happens to be in at the time we copy the data.

> Note: when we used ioremap_wc() both PPC/BE and X86 dropped in overall
> performance as it seems to make things cacheable and the speculation is the
> caching aspects delayed the results just enough to make the overall gain go
> down.

ioremap_wc should never make the access cacheable, but architectures
can fall back to a regular uncached mapping without write-combining when
there is no hardware support for write-combining.

However, when you use writeq() on powerpc, you get a very expensive
barriers ("sync") before each 8 byte chunk, which purposely defeats the
write-combining, plus you might get an indirect function call, depending on
the kernel configuration and the hardware you are running on.

Using __raw_writeq() should get both the endianess right and avoid all
those barriers you really don't want for what is essentially a memcpy().

       Arnd

James Smart Feb. 26, 2018, 9:41 p.m. UTC | #5

On 2/26/2018 12:04 PM, Arnd Bergmann wrote:
> For the endianess, the key to understanding this is that readl/writel and
> readq/writeq follow the convention of accessing data as little-endian because
> that is what 99% of MMIO accesses on PCI are: you have a 32-bit or 64-bit
> register value that gets copied between a register and the CPU, and it's
> up to the device manufacturer to implement that convention in hardware.
> The Linux implementation of writeq() on all architectures therefore puts the
> LSB into the first byte of the output.

ok, so I interpret this as readx/writex() get a cpu-endian-based value, 
and the routine compensates for cpu endianness as it puts it out on the 
io bus (LSB=1st byte). All well and good.

> In contrast, accessing memory using a pointer dereference is defined by
> the CPU manufacturer, so in a big-endian CPU the first byte corresponds
> to the MSB of the CPU register.
> 
> In your loop
> 
>                  for (i = 0; i < q->entry_size; i += sizeof(uint64_t))
>                          writeq(*((uint64_t *)(tmp + i)), q->dpp_regaddr + i);
> 
> that ends up copying the first byte of 'tmp[]' to byte 7 of dpp_regaddr,
> which makes no sense as the data in tmp is (presumably) just a stream of
> bytes that we want to end up on a disk in the same order, regardless
> of which mode the CPU happens to be in at the time we copy the data.

This is where I get a bit lost.

So what you're telling me is:
   with
     uint64_t a=5, *b;
     *b=5;
   that
     writeq(a, regptr);
   is not the same as
     writeq(*b, regptr);
   ??

That doesn't make sense to me as "*ptr" should have resulted in a 
cpu-endian value being given to writeq(), ignorant of how that value 
came into existence (pointer/constant/value in register). Then writeq() 
does it's thing to ensure LSB=1st byte

It would only make sense if writeq() can be a macro that such that it 
can resolve into something that is essentially
   *regport = *b;
thus it's treated as a bytestream starting at address b and does not 
necessarily compensate for the cpu-endianness of the multi-byte quantity.

-- james

Arnd Bergmann Feb. 27, 2018, 8:58 a.m. UTC | #6

On Mon, Feb 26, 2018 at 10:41 PM, James Smart <jsmart2021@gmail.com> wrote:
> On 2/26/2018 12:04 PM, Arnd Bergmann wrote:
>>
>> For the endianess, the key to understanding this is that readl/writel and
>> readq/writeq follow the convention of accessing data as little-endian
>> because
>> that is what 99% of MMIO accesses on PCI are: you have a 32-bit or 64-bit
>> register value that gets copied between a register and the CPU, and it's
>> up to the device manufacturer to implement that convention in hardware.
>> The Linux implementation of writeq() on all architectures therefore puts
>> the
>> LSB into the first byte of the output.
>
>
> ok, so I interpret this as readx/writex() get a cpu-endian-based value, and
> the routine compensates for cpu endianness as it puts it out on the io bus
> (LSB=1st byte). All well and good.

it helps to think of it as compensating for device endianess rather than
compensating for CPU endianess. The effect is the same of course.

>> In contrast, accessing memory using a pointer dereference is defined by
>> the CPU manufacturer, so in a big-endian CPU the first byte corresponds
>> to the MSB of the CPU register.
>>
>> In your loop
>>
>>                  for (i = 0; i < q->entry_size; i += sizeof(uint64_t))
>>                          writeq(*((uint64_t *)(tmp + i)), q->dpp_regaddr +
>> i);
>>
>> that ends up copying the first byte of 'tmp[]' to byte 7 of dpp_regaddr,
>> which makes no sense as the data in tmp is (presumably) just a stream of
>> bytes that we want to end up on a disk in the same order, regardless
>> of which mode the CPU happens to be in at the time we copy the data.
>
>
> This is where I get a bit lost.
>
> So what you're telling me is:
>   with
>     uint64_t a=5, *b;
>     *b=5;
>   that
>     writeq(a, regptr);
>   is not the same as
>     writeq(*b, regptr);
>   ??
>
> That doesn't make sense to me as "*ptr" should have resulted in a cpu-endian
> value being given to writeq(), ignorant of how that value came into
> existence (pointer/constant/value in register). Then writeq() does it's
> thing to ensure LSB=1st byte
>
> It would only make sense if writeq() can be a macro that such that it can
> resolve into something that is essentially
>   *regport = *b;
> thus it's treated as a bytestream starting at address b and does not
> necessarily compensate for the cpu-endianness of the multi-byte quantity.

What you are describing above is not a byte stream but dealing with
a 64-bit integer. In both cases you obviously end up with the destination
data being 0x05 0x00 0x00 0x00 0x00 0x00 0x00 0x00, there is no
difference.

The case that you have in the driver is more like

      char *bytestream = "this is a byte stream";
      u64 *tmp = (u64 *)bytestream;
      writeq(*tmp, regptr);
      writeq(*(tmp + 1), regptr + 1);

On a big-endian cpu, *tmp is the 64-bit number 0x7468697320697320,
and *(tmp +1) is 0x6120627974652073, as the access to the memory
location is performed using big-endian semantics.

When we write that to the PCI bus using little-endian semantics, the
buffer ends up being  0x20 0x73 0x69 0x20 0x73 0x69 0x68 0x74
0x20 0x73 0x20 0x65 0x74 0x79 0x62 0x20, or " si siht s etyb a".

        Arnd

James Smart Feb. 27, 2018, 6:05 p.m. UTC | #7

On 2/27/2018 12:58 AM, Arnd Bergmann wrote:
> What you are describing above is not a byte stream but dealing with
> a 64-bit integer. In both cases you obviously end up with the destination
> data being 0x05 0x00 0x00 0x00 0x00 0x00 0x00 0x00, there is no
> difference.
>
> The case that you have in the driver is more like
>
>        char *bytestream = "this is a byte stream";
>        u64 *tmp = (u64 *)bytestream;
>        writeq(*tmp, regptr);
>        writeq(*(tmp + 1), regptr + 1);
>
> On a big-endian cpu, *tmp is the 64-bit number 0x7468697320697320,
> and *(tmp +1) is 0x6120627974652073, as the access to the memory
> location is performed using big-endian semantics.
>
> When we write that to the PCI bus using little-endian semantics, the
> buffer ends up being  0x20 0x73 0x69 0x20 0x73 0x69 0x68 0x74
> 0x20 0x73 0x20 0x65 0x74 0x79 0x62 0x20, or " si siht s etyb a".

ok - I get what you are saying...

The summary is:  the writeX() routines take a cpu-endianness specific 
int value and send it to the device in a byte order corresponding to LSB 
to MSB.  On LE cpu's using writex() to PCI (LE bus), LSB to MSB mirrors 
byte order in memory.  But on BE cpu's to PCI, LSB to MSB means the byte 
order was effectively byteswapped as it is sent to the device.

So you point out a very real concern, as in most cases the source buffer 
is a bytestream and the desire is to send the bytestream in the same 
byte order as in memory.  It turns out we're somewhat lucky as the 
driver's source buffer wasn't a real bytestream, it was a cpu-endian 
relative structure so writex behavior would be ok. However, the driver 
is confused as it wants to use 64-bit copies, but the cpu-endian 
structure was based on 32-bit's at a time when mapping to the hardware. 
So things are ok if using writel, but not so ok if writeq.

As the code stands write now - this feature and the routine is only used 
when x86, thus LE, thus the whole LSB conversion by writeX works.  It's 
a mute point.

Please accept the patch as proposed. We will address these other issues 
as we include the "push" support for other architectures.

-- james

Arnd Bergmann Feb. 27, 2018, 8:15 p.m. UTC | #8

On Tue, Feb 27, 2018 at 7:05 PM, James Smart <james.smart@broadcom.com> wrote:
> On 2/27/2018 12:58 AM, Arnd Bergmann wrote:
>
> So you point out a very real concern, as in most cases the source buffer is
> a bytestream and the desire is to send the bytestream in the same byte order
> as in memory.  It turns out we're somewhat lucky as the driver's source
> buffer wasn't a real bytestream, it was a cpu-endian relative structure so
> writex behavior would be ok. However, the driver is confused as it wants to
> use 64-bit copies, but the cpu-endian structure was based on 32-bit's at a
> time when mapping to the hardware. So things are ok if using writel, but not
> so ok if writeq.

Ah, so it is a structure of only 32-bit integers? Obviously if you have 16-bit
or 64-bit integers mixed in with the structure, things get even more
complicated.

For an array of 32-bit little-endian registers, this variant should work
on all architectures:

#include <linux/io-64-nonatomic-lo-hi.h>

         for (i = 0; i < q->entry_size; i += sizeof(uint64_t))
                   writeq_relaxed(*((uint32_t *)(tmp + i) |
                                           *((uint32_t *)(tmp + i + 1) << 32,
                                           (q->dpp_regaddr + i)

On x86_64 and x86_32, that should do the exact same thing as
your version, but it should also work efficiently on other architectures.
The __raw_writeq() version I suggested earlier is not correct
if we are dealing with an array of __le32 values in hardware.

On  powerpc, the writeq_relaxed() unfortunately still prevents
write-combining, so that won't be efficient.

> As the code stands write now - this feature and the routine is only used
> when x86, thus LE, thus the whole LSB conversion by writeX works.  It's a
> mute point.
>
> Please accept the patch as proposed. We will address these other issues as
> we include the "push" support for other architectures.

Could you add an #ifdef and comment around the 'if (q->dpp_enable ...)'
block then to make sure that if anybody tries to make it work on other
architectures, they are aware of the problem?

      Arnd

James Smart Feb. 27, 2018, 8:24 p.m. UTC | #9

On 2/27/2018 12:15 PM, Arnd Bergmann wrote:
> Could you add an #ifdef and comment around the 'if (q->dpp_enable ...)'
> block then to make sure that if anybody tries to make it work on other
> architectures, they are aware of the problem?

the ifdef is around the area where wc is enabled. I'd prefer not to add 
another. However, I will add a comment about the issue, which can warn 
people.

-- james

Arnd Bergmann Feb. 27, 2018, 8:28 p.m. UTC | #10

On Tue, Feb 27, 2018 at 9:24 PM, James Smart <jsmart2021@gmail.com> wrote:
> On 2/27/2018 12:15 PM, Arnd Bergmann wrote:
>>
>> Could you add an #ifdef and comment around the 'if (q->dpp_enable ...)'
>> block then to make sure that if anybody tries to make it work on other
>> architectures, they are aware of the problem?
>
>
> the ifdef is around the area where wc is enabled. I'd prefer not to add
> another. However, I will add a comment about the issue, which can warn
> people.

Ok, that's probably good enough then. Thanks!

      Arnd

lpfc: correct writeq failures on 32-bit architectures

Commit Message

Comments

Patch