mbox series

[v3,0/6] RISC-V: Apply Zicboz to clear_page

Message ID 20230130120128.1349464-1-ajones@ventanamicro.com (mailing list archive)
Headers show
Series RISC-V: Apply Zicboz to clear_page | expand

Message

Andrew Jones Jan. 30, 2023, 12:01 p.m. UTC
When the Zicboz extension is available we can more rapidly zero naturally
aligned Zicboz block sized chunks of memory. As pages are always page
aligned and are larger than any Zicboz block size will be, then
clear_page() appears to be a good candidate for the extension. While cycle
count and energy consumption should also be considered, we can be pretty
certain that implementing clear_page() with the Zicboz extension is a win
by comparing the new dynamic instruction count with its current count[1].
Doing so we see that the new count is just over a quarter of the old count
(see patch4's commit message for more details).

For those of you who reviewed v1[2], you may be looking for the memset()
patches. As pointed out in v1, and a couple follow-up emails, it's not
clear that patching memset() is a win yet. When I get a chance to test
on real hardware with a comprehensive benchmark collection then I can
post the memset() patches separately (assuming the benchmarks show it's
worthwhile).

Dependencies:
  - "[PATCH v5 00/13] riscv: improve boot time isa extensions handling"
    https://lore.kernel.org/all/20230128172856.3814-1-jszhang@kernel.org/
  - "[PATCH v1 0/3] Remove toolchain dependencies for Zicbom"
    https://lore.kernel.org/all/20230108163356.3063839-1-conor@kernel.org/

The patches are also available here
https://github.com/jones-drew/linux/commits/riscv/zicboz-v3

To test over QEMU this branch may be used to enable Zicboz
https://gitlab.com/jones-drew/qemu/-/commits/riscv/zicboz

To test running a KVM guest with Zicboz this kvmtool branch may be used
https://github.com/jones-drew/kvmtool/commits/riscv/zicboz

Thanks,
drew

[1] I ported the functions under test to userspace and linked them with
    a test program. Then, I ran them under gdb with a script[3] which
    counted instructions by single stepping.
[2] https://lore.kernel.org/all/20221027130247.31634-1-ajones@ventanamicro.com/
[3] https://gist.github.com/jones-drew/487791c956ceca8c18adc2847eec9c60

v3:
  - CC'ed DT list
  - Improved commit message of DT bindings patch to point out relationship
    with cbom-block-size
  - Picked up an a-b from Conor

v2:
  - s/blksz/block_size/, improved commit message for "RISC-V: Add Zicboz
    detection and block size parsing", isa ext sorting [Conor]
  - Added dt binding patch [Heiko]
  - Picked up r-b's from Conor, Heiko, and Anup
  - Moved config symbol and CBO_zero() introduction to "RISC-V: Use Zicboz
    in clear_page when available" and improved its commit message and
    implementation (unrolled four times) [drew]
  - Dropped memset() patches [drew]
  - Rebased on ae4d39f75308 ("Merge patch "RISC-V: fix incorrect type of
    ARCH_CANAAN_K210_DTB_SOURCE"") plus the dependencies

Andrew Jones (6):
  RISC-V: Factor out body of riscv_init_cbom_blocksize loop
  dt-bindings: riscv: Document cboz-block-size
  RISC-V: Add Zicboz detection and block size parsing
  RISC-V: Use Zicboz in clear_page when available
  RISC-V: KVM: Provide UAPI for Zicboz block size
  RISC-V: KVM: Expose Zicboz to the guest

 .../devicetree/bindings/riscv/cpus.yaml       |  5 ++
 arch/riscv/Kconfig                            | 13 ++++
 arch/riscv/include/asm/cacheflush.h           |  3 +-
 arch/riscv/include/asm/hwcap.h                |  1 +
 arch/riscv/include/asm/insn-def.h             |  4 ++
 arch/riscv/include/asm/page.h                 |  6 +-
 arch/riscv/include/uapi/asm/kvm.h             |  2 +
 arch/riscv/kernel/cpu.c                       |  1 +
 arch/riscv/kernel/cpufeature.c                | 10 +++
 arch/riscv/kernel/setup.c                     |  2 +-
 arch/riscv/kvm/vcpu.c                         | 11 ++++
 arch/riscv/lib/Makefile                       |  1 +
 arch/riscv/lib/clear_page.S                   | 36 +++++++++++
 arch/riscv/mm/cacheflush.c                    | 64 +++++++++++--------
 14 files changed, 130 insertions(+), 29 deletions(-)
 create mode 100644 arch/riscv/lib/clear_page.S

Comments

Jeff Law Jan. 30, 2023, 6:30 p.m. UTC | #1
On 1/30/23 05:01, Andrew Jones wrote:
> When the Zicboz extension is available we can more rapidly zero naturally
> aligned Zicboz block sized chunks of memory. As pages are always page
> aligned and are larger than any Zicboz block size will be, then
> clear_page() appears to be a good candidate for the extension. While cycle
> count and energy consumption should also be considered, we can be pretty
> certain that implementing clear_page() with the Zicboz extension is a win
> by comparing the new dynamic instruction count with its current count[1].
> Doing so we see that the new count is just over a quarter of the old count
> (see patch4's commit message for more details).
> 
> For those of you who reviewed v1[2], you may be looking for the memset()
> patches. As pointed out in v1, and a couple follow-up emails, it's not
> clear that patching memset() is a win yet. When I get a chance to test
> on real hardware with a comprehensive benchmark collection then I can
> post the memset() patches separately (assuming the benchmarks show it's
> worthwhile).
So a note.  On the userspace side we are using cboz for clearing memory 
in memset.  While the data is intermixed with other changes, there's a 
very significant drop in stores and a host of related low level 
performance counters and a notable uptick in gcc #5 performance from 
spec2017 which is particularly sensitive to memory clearing.  We haven't 
seen any performance regressions attributable to using cboz across 
spec2017's integer suite.

I believe our current threshold setting is to use cboz for chunks >= 128 
bytes.

Jeff
Andrew Jones Jan. 30, 2023, 6:47 p.m. UTC | #2
On Mon, Jan 30, 2023 at 11:30:38AM -0700, Jeff Law wrote:
> 
> 
> On 1/30/23 05:01, Andrew Jones wrote:
> > When the Zicboz extension is available we can more rapidly zero naturally
> > aligned Zicboz block sized chunks of memory. As pages are always page
> > aligned and are larger than any Zicboz block size will be, then
> > clear_page() appears to be a good candidate for the extension. While cycle
> > count and energy consumption should also be considered, we can be pretty
> > certain that implementing clear_page() with the Zicboz extension is a win
> > by comparing the new dynamic instruction count with its current count[1].
> > Doing so we see that the new count is just over a quarter of the old count
> > (see patch4's commit message for more details).
> > 
> > For those of you who reviewed v1[2], you may be looking for the memset()
> > patches. As pointed out in v1, and a couple follow-up emails, it's not
> > clear that patching memset() is a win yet. When I get a chance to test
> > on real hardware with a comprehensive benchmark collection then I can
> > post the memset() patches separately (assuming the benchmarks show it's
> > worthwhile).
> So a note.  On the userspace side we are using cboz for clearing memory in
> memset.  While the data is intermixed with other changes, there's a very
> significant drop in stores and a host of related low level performance
> counters and a notable uptick in gcc #5 performance from spec2017 which is
> particularly sensitive to memory clearing.  We haven't seen any performance
> regressions attributable to using cboz across spec2017's integer suite.
> 
> I believe our current threshold setting is to use cboz for chunks >= 128
> bytes.

Thanks, Jeff! That's an encouraging report. I'll keep focused on
clear_page() with this series, but once I can get some numbers with
the memset patch, then I'll be happy to repost that as well.

Thanks,
drew
Jeff Law Jan. 30, 2023, 6:55 p.m. UTC | #3
[ Sorry for the duplicate.  Andrew indicated I'd used reply-list rather 
than reply-all.  ]


On 1/30/23 05:01, Andrew Jones wrote:
> When the Zicboz extension is available we can more rapidly zero naturally
> aligned Zicboz block sized chunks of memory. As pages are always page
> aligned and are larger than any Zicboz block size will be, then
> clear_page() appears to be a good candidate for the extension. While cycle
> count and energy consumption should also be considered, we can be pretty
> certain that implementing clear_page() with the Zicboz extension is a win
> by comparing the new dynamic instruction count with its current count[1].
> Doing so we see that the new count is just over a quarter of the old count
> (see patch4's commit message for more details).
> 
> For those of you who reviewed v1[2], you may be looking for the memset()
> patches. As pointed out in v1, and a couple follow-up emails, it's not
> clear that patching memset() is a win yet. When I get a chance to test
> on real hardware with a comprehensive benchmark collection then I can
> post the memset() patches separately (assuming the benchmarks show it's
> worthwhile).
> 
So a note.  On the userspace side we are using cboz for clearing memory 
in memset.  While the data is intermixed with other changes, there's a 
very significant drop in stores and a host of related low level 
performance counters and a notable uptick in gcc #5 performance from 
spec2017 which is particularly sensitive to memory clearing.  We haven't 
seen any performance regressions attributable to using cboz across 
spec2017's integer suite.

I believe our current threshold setting is to use cboz for chunks >= 128 
bytes.

Jeff