Revert "arm64: Increase the max granular size"

Message ID	1458120743-12145-1-git-send-email-opensource.ganesh@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Ganesh Mahendran <opensource.ganesh@gmail.com> To: catalin.marinas@arm.com, will.deacon@arm.com Subject: [PATCH] Revert "arm64: Increase the max granular size" Date: Wed, 16 Mar 2016 17:32:23 +0800 Message-Id: <1458120743-12145-1-git-send-email-opensource.ganesh@gmail.com> Precedence: list Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>, stable@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Mahendran Ganesh March 16, 2016, 9:32 a.m. UTC

Reverts commit 97303480753e ("arm64: Increase the max granular size").

The commit 97303480753e ("arm64: Increase the max granular size") will
degrade system performente in some cpus.

We test wifi network throughput with iperf on Qualcomm msm8996 CPU:
----------------
run on host:
  # iperf -s
run on device:
  # iperf -c <device-ip-addr> -t 100 -i 1
----------------

Test result:
----------------
with commit 97303480753e ("arm64: Increase the max granular size"):
    172MBits/sec

without commit 97303480753e ("arm64: Increase the max granular size"):
    230MBits/sec
----------------

Some module like slab/net will use the L1_CACHE_SHIFT, so if we do not
set the parameter correctly, it may affect the system performance.

So revert the commit.

Cc: stable@vger.kernel.org
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
---
 arch/arm64/include/asm/cache.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Will Deacon March 16, 2016, 10:07 a.m. UTC | #1

[adding Cavium folk and Timur]

On Wed, Mar 16, 2016 at 05:32:23PM +0800, Ganesh Mahendran wrote:
> Reverts commit 97303480753e ("arm64: Increase the max granular size").
> 
> The commit 97303480753e ("arm64: Increase the max granular size") will
> degrade system performente in some cpus.
> 
> We test wifi network throughput with iperf on Qualcomm msm8996 CPU:
> ----------------
> run on host:
>   # iperf -s
> run on device:
>   # iperf -c <device-ip-addr> -t 100 -i 1
> ----------------
> 
> Test result:
> ----------------
> with commit 97303480753e ("arm64: Increase the max granular size"):
>     172MBits/sec
> 
> without commit 97303480753e ("arm64: Increase the max granular size"):
>     230MBits/sec
> ----------------
> 
> Some module like slab/net will use the L1_CACHE_SHIFT, so if we do not
> set the parameter correctly, it may affect the system performance.
> 
> So revert the commit.

Unfortunately, the original patch is required to support the 128-byte L1
cache lines of Cavium ThunderX, so we can't simply revert it like this.
Similarly, the desire for a single, multiplatform kernel image prevents
us from reasonably fixing this at compile time to anything other than
the expected maximum value.

Furthermore, Timur previously said that the change is also required
"on our [Qualcomm] silicon", but I'm not sure if this is msm9886 or not:

http://lkml.kernel.org/r/CAOZdJXUiRMAguDV+HEJqPg57MyBNqEcTyaH+ya=U93NHb-pdJA@mail.gmail.com

You could look into making ARCH_DMA_MINALIGN a runtime value, but that
looks like an uphill struggle to me. Alternatively, we could only warn
if the CWG is bigger than L1_CACHE_BYTES *and* we have a non-coherent
DMA master, but that doesn't solve any performance issues from having
things like locks sharing cachelines, not that I think we ever got any
data on that (afaik, we don't pad locks to cacheline boundaries anyway).
I'm also not sure what it would mean for PCI NoSnoop transactions.

Will

Timur Tabi March 16, 2016, 1:06 p.m. UTC | #2

Will Deacon wrote:
> [adding Cavium folk and Timur]
>
> On Wed, Mar 16, 2016 at 05:32:23PM +0800, Ganesh Mahendran wrote:
>> Reverts commit 97303480753e ("arm64: Increase the max granular size").
>>
>> The commit 97303480753e ("arm64: Increase the max granular size") will
>> degrade system performente in some cpus.
>>
>> We test wifi network throughput with iperf on Qualcomm msm8996 CPU:
>> ----------------
>> run on host:
>>    # iperf -s
>> run on device:
>>    # iperf -c <device-ip-addr> -t 100 -i 1
>> ----------------
>>
>> Test result:
>> ----------------
>> with commit 97303480753e ("arm64: Increase the max granular size"):
>>      172MBits/sec
>>
>> without commit 97303480753e ("arm64: Increase the max granular size"):
>>      230MBits/sec
>> ----------------
>>
>> Some module like slab/net will use the L1_CACHE_SHIFT, so if we do not
>> set the parameter correctly, it may affect the system performance.
>>
>> So revert the commit.
>
> Unfortunately, the original patch is required to support the 128-byte L1
> cache lines of Cavium ThunderX, so we can't simply revert it like this.
> Similarly, the desire for a single, multiplatform kernel image prevents
> us from reasonably fixing this at compile time to anything other than
> the expected maximum value.
>
> Furthermore, Timur previously said that the change is also required
> "on our [Qualcomm] silicon", but I'm not sure if this is msm9886 or not:
>
> http://lkml.kernel.org/r/CAOZdJXUiRMAguDV+HEJqPg57MyBNqEcTyaH+ya=U93NHb-pdJA@mail.gmail.com

I was talking about our server part, the QDF2432.  At the time, I wasn't 
allowed to mention it by name.

> You could look into making ARCH_DMA_MINALIGN a runtime value, but that
> looks like an uphill struggle to me. Alternatively, we could only warn
> if the CWG is bigger than L1_CACHE_BYTES *and* we have a non-coherent
> DMA master, but that doesn't solve any performance issues from having
> things like locks sharing cachelines, not that I think we ever got any
> data on that (afaik, we don't pad locks to cacheline boundaries anyway).
> I'm also not sure what it would mean for PCI NoSnoop transactions.

Our internal version of this patch made it a Kconfig option.  Perhaps 
that would at least be an improvement over just reverting it?  We 
already have to have our own defconfig for the QDF2432.

Mark Rutland March 16, 2016, 2:03 p.m. UTC | #3

On Wed, Mar 16, 2016 at 08:06:22AM -0500, Timur Tabi wrote:
> Will Deacon wrote:
> >Unfortunately, the original patch is required to support the 128-byte L1
> >cache lines of Cavium ThunderX, so we can't simply revert it like this.
> >Similarly, the desire for a single, multiplatform kernel image prevents
> >us from reasonably fixing this at compile time to anything other than
> >the expected maximum value.
> >
> >Furthermore, Timur previously said that the change is also required
> >"on our [Qualcomm] silicon", but I'm not sure if this is msm9886 or not:
> >
> >http://lkml.kernel.org/r/CAOZdJXUiRMAguDV+HEJqPg57MyBNqEcTyaH+ya=U93NHb-pdJA@mail.gmail.com
> 
> I was talking about our server part, the QDF2432.  At the time, I
> wasn't allowed to mention it by name.
> 
> >You could look into making ARCH_DMA_MINALIGN a runtime value, but that
> >looks like an uphill struggle to me. Alternatively, we could only warn
> >if the CWG is bigger than L1_CACHE_BYTES *and* we have a non-coherent
> >DMA master, but that doesn't solve any performance issues from having
> >things like locks sharing cachelines, not that I think we ever got any
> >data on that (afaik, we don't pad locks to cacheline boundaries anyway).
> >I'm also not sure what it would mean for PCI NoSnoop transactions.
> 
> Our internal version of this patch made it a Kconfig option.
> Perhaps that would at least be an improvement over just reverting
> it?  We already have to have our own defconfig for the QDF2432.

While having an option for producing a less-portable, performance tuned kernel
might not be the end of the world, the defconfig is intended to function
correctly on all platforms (assuming LE and 4K page support).

Even if we were to add the option, the default would have to be the maximum
size known to be implemented.

If I understand correctly, the main reason that we need this for correctness is
non-coherent DMA to/from SLAB caches.

A more general approach (and more invasive, but perhaps less so than making
ARCH_DMA_MINALIGN usage completely dynamic) would be to determine at runtime
whether the CWG is larger than the configured ARCH_DMA_MINALIGN, and if so,
force the use of bounce buffers (which could be padded to the architectural
maximum of 2K) for non-coherent DMA. That nicely degrades to not mattering for
the case of coherent DMA.

I would consider NoSnoop a separate case. It's closer to "negatively coherent",
and always required page-aligned buffer anyway due to MMU behaviour.

Thanks,
Mark.

Catalin Marinas March 16, 2016, 2:18 p.m. UTC | #4

On Wed, Mar 16, 2016 at 08:06:22AM -0500, Timur Tabi wrote:
> Will Deacon wrote:
> >You could look into making ARCH_DMA_MINALIGN a runtime value, but that
> >looks like an uphill struggle to me. Alternatively, we could only warn
> >if the CWG is bigger than L1_CACHE_BYTES *and* we have a non-coherent
> >DMA master, but that doesn't solve any performance issues from having
> >things like locks sharing cachelines, not that I think we ever got any
> >data on that (afaik, we don't pad locks to cacheline boundaries anyway).
> >I'm also not sure what it would mean for PCI NoSnoop transactions.
> 
> Our internal version of this patch made it a Kconfig option.  Perhaps that
> would at least be an improvement over just reverting it?  We already have to
> have our own defconfig for the QDF2432.

Why do you need your own defconfig? If it's just on the short term until
all your code is upstream, that's fine, but this goes against the single
Image aim. I would like defconfig to cover all supported SoCs (and yes,
ACPI on by default once we deem it !EXPERT anymore), though at some
point we may need a server/mobile split (if the generated image is too
large, maybe more stuff being built as modules).

Will Deacon March 16, 2016, 2:35 p.m. UTC | #5

On Wed, Mar 16, 2016 at 02:03:35PM +0000, Mark Rutland wrote:
> If I understand correctly, the main reason that we need this for correctness is
> non-coherent DMA to/from SLAB caches.
> 
> A more general approach (and more invasive, but perhaps less so than making
> ARCH_DMA_MINALIGN usage completely dynamic) would be to determine at runtime
> whether the CWG is larger than the configured ARCH_DMA_MINALIGN, and if so,
> force the use of bounce buffers (which could be padded to the architectural
> maximum of 2K) for non-coherent DMA. That nicely degrades to not mattering for
> the case of coherent DMA.
>
> I would consider NoSnoop a separate case. It's closer to "negatively coherent",
> and always required page-aligned buffer anyway due to MMU behaviour.

What makes you say that? There are no such alignment requirements for
buffers that may be accessed with a NoSnoop transaction. On ARM, we'll
have a mismatched alias, but we'd need to solve that with explicit
cache maintenance (and my understanding is that's what things like GPU
drivers already do on x86).

Will

Mark Rutland March 16, 2016, 2:54 p.m. UTC | #6

On Wed, Mar 16, 2016 at 02:35:35PM +0000, Will Deacon wrote:
> On Wed, Mar 16, 2016 at 02:03:35PM +0000, Mark Rutland wrote:
> > If I understand correctly, the main reason that we need this for correctness is
> > non-coherent DMA to/from SLAB caches.
> > 
> > A more general approach (and more invasive, but perhaps less so than making
> > ARCH_DMA_MINALIGN usage completely dynamic) would be to determine at runtime
> > whether the CWG is larger than the configured ARCH_DMA_MINALIGN, and if so,
> > force the use of bounce buffers (which could be padded to the architectural
> > maximum of 2K) for non-coherent DMA. That nicely degrades to not mattering for
> > the case of coherent DMA.
> >
> > I would consider NoSnoop a separate case. It's closer to "negatively coherent",
> > and always required page-aligned buffer anyway due to MMU behaviour.
> 
> What makes you say that? There are no such alignment requirements for
> buffers that may be accessed with a NoSnoop transaction. On ARM, we'll
> have a mismatched alias, but we'd need to solve that with explicit
> cache maintenance (and my understanding is that's what things like GPU
> drivers already do on x86).

I was under the impression that NoSnoop transactions were permitted to be
Cacheable, even if non-snooping (e.g. allowing them to allocate and hit in a
system cache).

If that is permitted, then data corruption could potentially occur in the
presence of another cacheable alias due to things like line migration (e.g. a
CPU making a speculative fetch and taking ownership of a line that was in the
system cache). To avoid that, you'd have to remove any cachable alias, for
which we only have page-granular control.

If that is not permitted, then no-snoop is effectively non-cacheable and
non-coherent, and my comment doesn't hold.

Thanks,
Mark.

Timur Tabi March 16, 2016, 3:26 p.m. UTC | #7

Catalin Marinas wrote:
> Why do you need your own defconfig? If it's just on the short term until
> all your code is upstream, that's fine, but this goes against the single
> Image aim. I would like defconfig to cover all supported SoCs (and yes,
> ACPI on by default once we deem it !EXPERT anymore), though at some
> point we may need a server/mobile split (if the generated image is too
> large, maybe more stuff being built as modules).

Yes, that's exactly it.  Ours is an ACPI system, and so we have to have 
our own defconfig for now.  We're holding off on pushing our own 
defconfig changes (enabling drivers, etc) until ACPI is enabled in 
arch/arm64/configs/defconfig.

My understanding is that ACPI won't be enabled by default until at least 
after the GIC driver is fully ACPI-enabled.

Catalin Marinas March 17, 2016, 2:27 p.m. UTC | #8

On Wed, Mar 16, 2016 at 10:26:08AM -0500, Timur Tabi wrote:
> Catalin Marinas wrote:
> >Why do you need your own defconfig? If it's just on the short term until
> >all your code is upstream, that's fine, but this goes against the single
> >Image aim. I would like defconfig to cover all supported SoCs (and yes,
> >ACPI on by default once we deem it !EXPERT anymore), though at some
> >point we may need a server/mobile split (if the generated image is too
> >large, maybe more stuff being built as modules).
> 
> Yes, that's exactly it.  Ours is an ACPI system, and so we have to have our
> own defconfig for now.  We're holding off on pushing our own defconfig
> changes (enabling drivers, etc) until ACPI is enabled in
> arch/arm64/configs/defconfig.

Is there anything that prevents you from providing a dtb/dts for this
SoC?

Timur Tabi March 17, 2016, 2:49 p.m. UTC | #9

Catalin Marinas wrote:
>> >Yes, that's exactly it.  Ours is an ACPI system, and so we have to have our
>> >own defconfig for now.  We're holding off on pushing our own defconfig
>> >changes (enabling drivers, etc) until ACPI is enabled in
>> >arch/arm64/configs/defconfig.

> Is there anything that prevents you from providing a dtb/dts for this
> SoC?

We don't have a boot loader capable of passing a device tree to the 
kernel.  It's an ARM Server chip.  It doesn't do device tree.  It's 100% 
ACPI.  We boot with UEFI that configures the system and generates ACPI 
tables.

I just want to make this crystal clear, because it comes up every now 
and then.  The QDF2432 is an ACPI-only SOC with no device tree support 
whatsoever.  Zero.  Zip.  Nada.  It's not an option.

Keep in mind that on an ACPI system like ours, the boot loader (UEFI in 
our case) configures the system extensively.  It does a lot of things 
that the kernel would normally do on a device tree system.  For example, 
pin control is handled completely by UEFI.  The kernel does not set the 
pin muxes or GPIO directions.  That means we don't support dynamic pin 
muxing.  Before the kernel is booted, the GPIO pins are fixed.

We're not going to create an entire device tree from scratch for this 
chip, and then make the extensive modifications necessary to our boot 
loader for parsing and modifying that device tree.  That would take 
months of work, and it would be all throw-away code.

Catalin Marinas March 17, 2016, 3:37 p.m. UTC | #10

On Thu, Mar 17, 2016 at 09:49:51AM -0500, Timur Tabi wrote:
> Catalin Marinas wrote:
> >>>Yes, that's exactly it.  Ours is an ACPI system, and so we have to have our
> >>>own defconfig for now.  We're holding off on pushing our own defconfig
> >>>changes (enabling drivers, etc) until ACPI is enabled in
> >>>arch/arm64/configs/defconfig.
> 
> >Is there anything that prevents you from providing a dtb/dts for this
> >SoC?
> 
> We don't have a boot loader capable of passing a device tree to the kernel.
> It's an ARM Server chip.  It doesn't do device tree.  It's 100% ACPI.  We
> boot with UEFI that configures the system and generates ACPI tables.

Well, I disagree with the idea that server == ACPI. But I guess you knew
this already.

> I just want to make this crystal clear, because it comes up every now and
> then.  The QDF2432 is an ACPI-only SOC with no device tree support
> whatsoever.  Zero.  Zip.  Nada.  It's not an option.

That's your choice really, I don't care much (as long as current ACPI
support has all the features you need; if it doesn't, there is a good
chance that your SoC won't be supported in mainline until it's sorted).

> Keep in mind that on an ACPI system like ours, the boot loader (UEFI in our
> case) configures the system extensively.  It does a lot of things that the
> kernel would normally do on a device tree system.  For example, pin control
> is handled completely by UEFI.  The kernel does not set the pin muxes or
> GPIO directions.  That means we don't support dynamic pin muxing.  Before
> the kernel is booted, the GPIO pins are fixed.

And that's great. But you are mistaken in thinking that DT requires lots
of drivers in the kernel and prevents the firmware from doing sane
stuff. DT rather gained additional features out of necessity since the
firmware was not always doing a proper job at hardware initialisation.
A DT-enabled kernel does not impose restrictions on such firmware
features. With ACPI, the choice is not as wide and forces vendors to
look into their firmware story from a different angle (until they figure
the _DSD+PRP0001 out and we end up with DT emulated in ACPI).

If the GPIO pins are fixed at boot-time, they don't even need to be
described in the DT, just let the firmware configure them. However, if
they need to be changed at run-time (which does not seem to be your
case), that's where you have a choice of either kernel driver (DT) or
AML code (ACPI) (or both). Otherwise, without this run-time aspect, both
DT and ACPI are just slightly different ways to provide a static
platform description.

> We're not going to create an entire device tree from scratch for this chip,
> and then make the extensive modifications necessary to our boot loader for
> parsing and modifying that device tree.  That would take months of work, and
> it would be all throw-away code.

As I said above, that's your choice and it depends on your timeline and
mainline support requirements. I however disagree with this being
"throw-away code" (or even taking "months of work").

Marc Zyngier March 17, 2016, 4:03 p.m. UTC | #11

On 17/03/16 15:37, Catalin Marinas wrote:
> On Thu, Mar 17, 2016 at 09:49:51AM -0500, Timur Tabi wrote:

[...]

>> Keep in mind that on an ACPI system like ours, the boot loader (UEFI in our
>> case) configures the system extensively.  It does a lot of things that the
>> kernel would normally do on a device tree system.  For example, pin control
>> is handled completely by UEFI.  The kernel does not set the pin muxes or
>> GPIO directions.  That means we don't support dynamic pin muxing.  Before
>> the kernel is booted, the GPIO pins are fixed.
> 
> And that's great. But you are mistaken in thinking that DT requires lots
> of drivers in the kernel and prevents the firmware from doing sane
> stuff. DT rather gained additional features out of necessity since the
> firmware was not always doing a proper job at hardware initialisation.
> A DT-enabled kernel does not impose restrictions on such firmware
> features. With ACPI, the choice is not as wide and forces vendors to
> look into their firmware story from a different angle (until they figure
> the _DSD+PRP0001 out and we end up with DT emulated in ACPI).

Or even worse, perverting the couple of things that were actually OK in
ACPI by inventing a new layer of broken stuff:

http://thread.gmane.org/gmane.linux.ports.arm.msm/17828

Once we've reached that level, _DSD fells all nice and cuddly.

	M.

Andrew Pinski March 17, 2016, 6:07 p.m. UTC | #12

On 3/17/2016 7:27 AM, Catalin Marinas wrote:
> On Wed, Mar 16, 2016 at 10:26:08AM -0500, Timur Tabi wrote:
>> Catalin Marinas wrote:
>>> Why do you need your own defconfig? If it's just on the short term until
>>> all your code is upstream, that's fine, but this goes against the single
>>> Image aim. I would like defconfig to cover all supported SoCs (and yes,
>>> ACPI on by default once we deem it !EXPERT anymore), though at some
>>> point we may need a server/mobile split (if the generated image is too
>>> large, maybe more stuff being built as modules).
>> Yes, that's exactly it.  Ours is an ACPI system, and so we have to have our
>> own defconfig for now.  We're holding off on pushing our own defconfig
>> changes (enabling drivers, etc) until ACPI is enabled in
>> arch/arm64/configs/defconfig.
> Is there anything that prevents you from providing a dtb/dts for this
> SoC?

Note ThunderX's SOC have customers  where some are embedded users 
(uboot) and server users (UEFI).  The cores always have 128 byte 
cacheline size.  So please don't make this dependent on ACPI.  Note the 
defconfig works correctly on T88.

Thanks,
Andrew


>

Timur Tabi March 17, 2016, 6:34 p.m. UTC | #13

Andrew Pinski wrote:
>>
>
> Note ThunderX's SOC have customers  where some are embedded users
> (uboot) and server users (UEFI).  The cores always have 128 byte
> cacheline size.  So please don't make this dependent on ACPI.  Note the
> defconfig works correctly on T88.

This thread is getting off-topic.  There's nothing about the cacheline 
size that is dependent on ACPI or DT.  Catalin was wondering why we have 
our own defconfig for our ARM64 SOC, and I replied that it's because 
ACPI is not enabled yet in the upstream defconfig.

Catalin Marinas March 17, 2016, 6:37 p.m. UTC | #14

On Thu, Mar 17, 2016 at 11:07:00AM -0700, Andrew Pinski wrote:
> On 3/17/2016 7:27 AM, Catalin Marinas wrote:
> >On Wed, Mar 16, 2016 at 10:26:08AM -0500, Timur Tabi wrote:
> >>Catalin Marinas wrote:
> >>>Why do you need your own defconfig? If it's just on the short term until
> >>>all your code is upstream, that's fine, but this goes against the single
> >>>Image aim. I would like defconfig to cover all supported SoCs (and yes,
> >>>ACPI on by default once we deem it !EXPERT anymore), though at some
> >>>point we may need a server/mobile split (if the generated image is too
> >>>large, maybe more stuff being built as modules).
> >>Yes, that's exactly it.  Ours is an ACPI system, and so we have to have our
> >>own defconfig for now.  We're holding off on pushing our own defconfig
> >>changes (enabling drivers, etc) until ACPI is enabled in
> >>arch/arm64/configs/defconfig.
> >Is there anything that prevents you from providing a dtb/dts for this
> >SoC?
> 
> Note ThunderX's SOC have customers  where some are embedded users (uboot)
> and server users (UEFI).  The cores always have 128 byte cacheline size.  So
> please don't make this dependent on ACPI. 

Definitely not, this has nothing to do with ACPI or servers. My comment
on different defconfig was more about things like 64K pages vs 4K, if
the former ever prove useful in practice. Who knows, we may even see
ACPI for IoT ;) (with MS involvement in Raspberry Pi)

> Note the defconfig works correctly on T88.

We have two aspects to address: one is correctness and the other is
performance. But we bundle everything under L1_CACHE_BYTES which affects
platforms that don't have such large cache lines (actually, it may even
affect those that do; has anyone done actual benchmarks?)

As Will suggested, we could try to revert L1_CACHE_BYTES back to 64 and
make ARCH_DMA_MINALIGN run-time based on CWG for correctness. Would this
work on the Cavium hardware?

Chalamarla, Tirumalesh March 18, 2016, 9:05 p.m. UTC | #15

On 3/16/16, 2:32 AM, "linux-arm-kernel on behalf of Ganesh Mahendran" <linux-arm-kernel-bounces@lists.infradead.org on behalf of opensource.ganesh@gmail.com> wrote:

>Reverts commit 97303480753e ("arm64: Increase the max granular size").
>
>The commit 97303480753e ("arm64: Increase the max granular size") will
>degrade system performente in some cpus.
>
>We test wifi network throughput with iperf on Qualcomm msm8996 CPU:
>----------------
>run on host:
>  # iperf -s
>run on device:
>  # iperf -c <device-ip-addr> -t 100 -i 1
>----------------
>
>Test result:
>----------------
>with commit 97303480753e ("arm64: Increase the max granular size"):
>    172MBits/sec
>
>without commit 97303480753e ("arm64: Increase the max granular size"):
>    230MBits/sec
>----------------
>
>Some module like slab/net will use the L1_CACHE_SHIFT, so if we do not
>set the parameter correctly, it may affect the system performance.
>
>So revert the commit.

Is there any explanation why is this so? May be there is an alternative to this, apart from reverting the commit.

Until now it seems L1_CACHE_SHIFT is the max of supported chips. But now we are making it 64byte, is there any reason why not 32. 

Thanks,
Tirumalesh. 
>
>Cc: stable@vger.kernel.org
>Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
>---
> arch/arm64/include/asm/cache.h |    2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
>index 5082b30..bde4499 100644
>--- a/arch/arm64/include/asm/cache.h
>+++ b/arch/arm64/include/asm/cache.h
>@@ -18,7 +18,7 @@
> 
> #include <asm/cachetype.h>
> 
>-#define L1_CACHE_SHIFT		7
>+#define L1_CACHE_SHIFT		6
> #define L1_CACHE_BYTES		(1 << L1_CACHE_SHIFT)
> 
> /*
>-- 
>1.7.9.5
>
>
>_______________________________________________
>linux-arm-kernel mailing list
>linux-arm-kernel@lists.infradead.org
>http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Mahendran Ganesh March 21, 2016, 1:56 a.m. UTC | #16

Hello, Tirumalesh:

2016-03-19 5:05 GMT+08:00 Chalamarla, Tirumalesh
<Tirumalesh.Chalamarla@caviumnetworks.com>:
>
>
>
>
>
> On 3/16/16, 2:32 AM, "linux-arm-kernel on behalf of Ganesh Mahendran" <linux-arm-kernel-bounces@lists.infradead.org on behalf of opensource.ganesh@gmail.com> wrote:
>
>>Reverts commit 97303480753e ("arm64: Increase the max granular size").
>>
>>The commit 97303480753e ("arm64: Increase the max granular size") will
>>degrade system performente in some cpus.
>>
>>We test wifi network throughput with iperf on Qualcomm msm8996 CPU:
>>----------------
>>run on host:
>>  # iperf -s
>>run on device:
>>  # iperf -c <device-ip-addr> -t 100 -i 1
>>----------------
>>
>>Test result:
>>----------------
>>with commit 97303480753e ("arm64: Increase the max granular size"):
>>    172MBits/sec
>>
>>without commit 97303480753e ("arm64: Increase the max granular size"):
>>    230MBits/sec
>>----------------
>>
>>Some module like slab/net will use the L1_CACHE_SHIFT, so if we do not
>>set the parameter correctly, it may affect the system performance.
>>
>>So revert the commit.
>
> Is there any explanation why is this so? May be there is an alternative to this, apart from reverting the commit.
>

I just think the commit 97303480753e ("arm64: Increase the max
granular size") introduced new problem for other Socs which
the L1 cache line size is not 128 Bytes. So I wanted to revert this commit.

> Until now it seems L1_CACHE_SHIFT is the max of supported chips. But now we are making it 64byte, is there any reason why not 32.
>

We could not simply set the L1_CACHE_SHIFT to max. There are other
places which use L1 cache line size.
If we just set the L1 cache line size to the max, the memory footprint
and the system performance will be affected.
For example:
------
#define SMP_CACHE_BYTES L1_CACHE_BYTES
#define SKB_DATA_ALIGN(X) ALIGN(X, SMP_CACHE_BYTES)
------

Thanks.

> Thanks,
> Tirumalesh.
>>
>>Cc: stable@vger.kernel.org
>>Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
>>---
>> arch/arm64/include/asm/cache.h |    2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>>diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
>>index 5082b30..bde4499 100644
>>--- a/arch/arm64/include/asm/cache.h
>>+++ b/arch/arm64/include/asm/cache.h
>>@@ -18,7 +18,7 @@
>>
>> #include <asm/cachetype.h>
>>
>>-#define L1_CACHE_SHIFT                7
>>+#define L1_CACHE_SHIFT                6
>> #define L1_CACHE_BYTES                (1 << L1_CACHE_SHIFT)
>>
>> /*
>>--
>>1.7.9.5
>>
>>
>>_______________________________________________
>>linux-arm-kernel mailing list
>>linux-arm-kernel@lists.infradead.org
>>http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Revert "arm64: Increase the max granular size"

Commit Message

Comments

Patch