mbox series

[V13,-,RESEND,00/10] arm64/perf: Enable branch stack sampling

Message ID 20230711082455.215983-1-anshuman.khandual@arm.com (mailing list archive)
Headers show
Series arm64/perf: Enable branch stack sampling | expand

Message

Anshuman Khandual July 11, 2023, 8:24 a.m. UTC
This series enables perf branch stack sampling support on arm64 platform
via a new arch feature called Branch Record Buffer Extension (BRBE). All
relevant register definitions could be accessed here.

https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers

This series applies on 6.5-rc1.

Changes in V13:

https://lore.kernel.org/all/20230622065351.1092893-1-anshuman.khandual@arm.com/

- Added branch callback stubs for aarch32 pmuv3 based platforms
- Updated the comments for capture_brbe_regset()
- Deleted the comments in __read_brbe_regset()
- Reversed the arguments order in capture_brbe_regset() and brbe_branch_save()
- Fixed BRBE_BANK[0|1]_IDX_MAX indices comparison in armv8pmu_branch_read()
- Fixed BRBE_BANK[0|1]_IDX_MAX indices comparison in capture_brbe_regset()

Changes in V12:

https://lore.kernel.org/all/20230615133239.442736-1-anshuman.khandual@arm.com/

- Replaced branch types with complete DIRECT/INDIRECT prefixes/suffixes
- Replaced branch types with complete INSN/ALIGN prefixes/suffixes
- Replaced return branch types as simple RET/ERET
- Replaced time field GST_PHYSICAL as GUEST_PHYSICAL
- Added 0 padding for BRBIDR0_EL1.NUMREC enum values
- Dropped helper arm_pmu_branch_stack_supported()
- Renamed armv8pmu_branch_valid() as armv8pmu_branch_attr_valid()
- Separated perf_task_ctx_cache setup from arm_pmu private allocation
- Collected changes to branch_records_alloc() in a single patch [5/10]
- Reworked and cleaned up branch_records_alloc()
- Reworked armv8pmu_branch_read() with new loop iterations in patch [6/10]
- Reworked capture_brbe_regset() with new loop iterations in patch [8/10]
- Updated the comment in branch_type_to_brbcr()
- Fixed the comment before stitch_stored_live_entries()
- Fixed BRBINFINJ_EL1 definition for VALID_FULL enum field
- Factored out helper __read_brbe_regset() from capture_brbe_regset()
- Dropped the helper copy_brbe_regset()
- Simplified stitch_stored_live_entries() with memcpy(), memmove()
- Reworked armv8pmu_probe_pmu() to bail out early with !probe.present
- Rework brbe_attributes_probe() without 'struct brbe_hw_attr'
- Dropped 'struct brbe_hw_attr' argument from capture_brbe_regset()
- Dropped 'struct brbe_hw_attr' argument from brbe_branch_save()
- Dropped arm_pmu->private and added arm_pmu->reg_trbidr instead

Changes in V11:

https://lore.kernel.org/all/20230531040428.501523-1-anshuman.khandual@arm.com/

- Fixed the crash for per-cpu events without event->pmu_ctx->task_ctx_data

Changes in V10:

https://lore.kernel.org/all/20230517022410.722287-1-anshuman.khandual@arm.com/

- Rebased the series on v6.4-rc2
- Moved ARMV8 PMUV3 changes inside drivers/perf/arm_pmuv3.c
- Moved BRBE driver changes inside drivers/perf/arm_brbe.[c|h]
- Moved the WARN_ON() inside the if condition in armv8pmu_handle_irq()

Changes in V9:

https://lore.kernel.org/all/20230315051444.1683170-1-anshuman.khandual@arm.com/

- Fixed build problem with has_branch_stack() in arm64 header
- BRBINF_EL1 definition has been changed from 'Sysreg' to 'SysregFields'
- Renamed all BRBINF_EL1 call sites as BRBINFx_EL1 
- Dropped static const char branch_filter_error_msg[]
- Implemented a positive list check for BRBE supported perf branch filters
- Added a comment in armv8pmu_handle_irq()
- Implemented per-cpu allocation for struct branch_record records
- Skipped looping through bank 1 if an invalid record is detected in bank 0
- Added comment in armv8pmu_branch_read() explaining prohibited region etc
- Added comment warning about erroneously marking transactions as aborted
- Replaced the first argument (perf_branch_entry) in capture_brbe_flags()
- Dropped the last argument (idx) in capture_brbe_flags()
- Dropped the brbcr argument from capture_brbe_flags()
- Used perf_sample_save_brstack() to capture branch records for perf_sample_data
- Added comment explaining rationale for setting BRBCR_EL1_FZP for user only traces
- Dropped BRBE prohibited state mechanism while in armv8pmu_branch_read()
- Implemented event task context based branch records save mechanism

Changes in V8:

https://lore.kernel.org/all/20230123125956.1350336-1-anshuman.khandual@arm.com/

- Replaced arm_pmu->features as arm_pmu->has_branch_stack, updated its helper
- Added a comment and line break before arm_pmu->private element 
- Added WARN_ON_ONCE() in helpers i.e armv8pmu_branch_[read|valid|enable|disable]()
- Dropped comments in armv8pmu_enable_event() and armv8pmu_disable_event()
- Replaced open bank encoding in BRBFCR_EL1 with SYS_FIELD_PREP()
- Changed brbe_hw_attr->brbe_version from 'bool' to 'int'
- Updated pr_warn() as pr_warn_once() with values in brbe_get_perf_[type|priv]()
- Replaced all pr_warn_once() as pr_debug_once() in armv8pmu_branch_valid()
- Added a comment in branch_type_to_brbcr() for the BRBCR_EL1 privilege settings
- Modified the comment related to BRBINFx_EL1.LASTFAILED in capture_brbe_flags()   
- Modified brbe_get_perf_entry_type() as brbe_set_perf_entry_type()
- Renamed brbe_valid() as brbe_record_is_complete()
- Renamed brbe_source() as brbe_record_is_source_only()
- Renamed brbe_target() as brbe_record_is_target_only()
- Inverted checks for !brbe_record_is_[target|source]_only() for info capture
- Replaced 'fetch' with 'get' in all helpers that extract field value
- Dropped 'static int brbe_current_bank' optimization in select_brbe_bank()
- Dropped select_brbe_bank_index() completely, added capture_branch_entry()
- Process captured branch entries in two separate loops one for each BRBE bank
- Moved branch_records_alloc() inside armv8pmu_probe_pmu()
- Added a forward declaration for the helper has_branch_stack()
- Added new callbacks armv8pmu_private_alloc() and armv8pmu_private_free()
- Updated armv8pmu_probe_pmu() to allocate the private structure before SMP call

Changes in V7:

https://lore.kernel.org/all/20230105031039.207972-1-anshuman.khandual@arm.com/

- Folded [PATCH 7/7] into [PATCH 3/7] which enables branch stack sampling event
- Defined BRBFCR_EL1_BRANCH_FILTERS, BRBCR_EL1_DEFAULT_CONFIG in the header
- Defined BRBFCR_EL1_DEFAULT_CONFIG in the header
- Updated BRBCR_EL1_DEFAULT_CONFIG with BRBCR_EL1_FZP
- Defined BRBCR_EL1_DEFAULT_TS in the header
- Updated BRBCR_EL1_DEFAULT_CONFIG with BRBCR_EL1_DEFAULT_TS
- Moved BRBCR_EL1_DEFAULT_CONFIG check inside branch_type_to_brbcr()
- Moved down BRBCR_EL1_CC, BRBCR_EL1_MPRED later in branch_type_to_brbcr()
- Also set BRBE in paused state in armv8pmu_branch_disable()
- Dropped brbe_paused(), set_brbe_paused() helpers
- Extracted error string via branch_filter_error_msg[] for armv8pmu_branch_valid()
- Replaced brbe_v1p1 with brbe_version in struct brbe_hw_attr
- Added valid_brbe_[cc, format, version]() helpers
- Split a separate brbe_attributes_probe() from armv8pmu_branch_probe()
- Capture event->attr.branch_sample_type earlier in armv8pmu_branch_valid()
- Defined enum brbe_bank_idx with possible values for BRBE bank indices
- Changed armpmu->hw_attr into armpmu->private
- Added missing space in stub definition for armv8pmu_branch_valid()
- Replaced both kmalloc() with kzalloc()
- Added BRBE_BANK_MAX_ENTRIES
- Updated comment for capture_brbe_flags()
- Updated comment for struct brbe_hw_attr
- Dropped space after type cast in couple of places
- Replaced inverse with negation for testing BRBCR_EL1_FZP in armv8pmu_branch_read()
- Captured cpuc->branches->branch_entries[idx] in a local variable
- Dropped saved_priv from armv8pmu_branch_read()
- Reorganize PERF_SAMPLE_BRANCH_NO_[CYCLES|NO_FLAGS] related configuration
- Replaced with FIELD_GET() and FIELD_PREP() wherever applicable
- Replaced BRBCR_EL1_TS_PHYSICAL with BRBCR_EL1_TS_VIRTUAL
- Moved valid_brbe_nr(), valid_brbe_cc(), valid_brbe_format(), valid_brbe_version()
  select_brbe_bank(), select_brbe_bank_index() helpers inside the C implementation
- Reorganized brbe_valid_nr() and dropped the pr_warn() message
- Changed probe sequence in brbe_attributes_probe()
- Added 'brbcr' argument into capture_brbe_flags() to ascertain correct state
- Disable BRBE before disabling the PMU event counter
- Enable PERF_SAMPLE_BRANCH_HV filters when is_kernel_in_hyp_mode()
- Guard armv8pmu_reset() & armv8pmu_sched_task() with arm_pmu_branch_stack_supported()

Changes in V6:

https://lore.kernel.org/linux-arm-kernel/20221208084402.863310-1-anshuman.khandual@arm.com/

- Restore the exception level privilege after reading the branch records
- Unpause the buffer after reading the branch records
- Decouple BRBCR_EL1_EXCEPTION/ERTN from perf event privilege level
- Reworked BRBE implementation and branch stack sampling support on arm pmu
- BRBE implementation is now part of overall ARMV8 PMU implementation
- BRBE implementation moved from drivers/perf/ to inside arch/arm64/kernel/
- CONFIG_ARM_BRBE_PMU renamed as CONFIG_ARM64_BRBE in arch/arm64/Kconfig
- File moved - drivers/perf/arm_pmu_brbe.c -> arch/arm64/kernel/brbe.c
- File moved - drivers/perf/arm_pmu_brbe.h -> arch/arm64/kernel/brbe.h
- BRBE name has been dropped from struct arm_pmu and struct hw_pmu_events
- BRBE name has been abstracted out as 'branches' in arm_pmu and hw_pmu_events
- BRBE name has been abstracted out as 'branches' in ARMV8 PMU implementation
- Added sched_task() callback into struct arm_pmu
- Added 'hw_attr' into struct arm_pmu encapsulating possible PMU HW attributes
- Dropped explicit attributes brbe_(v1p1, nr, cc, format) from struct arm_pmu
- Dropped brbfcr, brbcr, registers scratch area from struct hw_pmu_events
- Dropped brbe_users, brbe_context tracking in struct hw_pmu_events
- Added 'features' tracking into struct arm_pmu with ARM_PMU_BRANCH_STACK flag
- armpmu->hw_attr maps into 'struct brbe_hw_attr' inside BRBE implementation
- Set ARM_PMU_BRANCH_STACK in 'arm_pmu->features' after successful BRBE probe
- Added armv8pmu_branch_reset() inside armv8pmu_branch_enable()
- Dropped brbe_supported() as events will be rejected via ARM_PMU_BRANCH_STACK
- Dropped set_brbe_disabled() as well
- Reformated armv8pmu_branch_valid() warnings while rejecting unsupported events

Changes in V5:

https://lore.kernel.org/linux-arm-kernel/20221107062514.2851047-1-anshuman.khandual@arm.com/

- Changed BRBCR_EL1.VIRTUAL from 0b1 to 0b01
- Changed BRBFCR_EL1.EnL into BRBFCR_EL1.EnI
- Changed config ARM_BRBE_PMU from 'tristate' to 'bool'

Changes in V4:

https://lore.kernel.org/all/20221017055713.451092-1-anshuman.khandual@arm.com/

- Changed ../tools/sysreg declarations as suggested
- Set PERF_SAMPLE_BRANCH_STACK in data.sample_flags
- Dropped perfmon_capable() check in armpmu_event_init()
- s/pr_warn_once/pr_info in armpmu_event_init()
- Added brbe_format element into struct pmu_hw_events
- Changed v1p1 as brbe_v1p1 in struct pmu_hw_events
- Dropped pr_info() from arm64_pmu_brbe_probe(), solved LOCKDEP warning

Changes in V3:

https://lore.kernel.org/all/20220929075857.158358-1-anshuman.khandual@arm.com/

- Moved brbe_stack from the stack and now dynamically allocated
- Return PERF_BR_PRIV_UNKNOWN instead of -1 in brbe_fetch_perf_priv()
- Moved BRBIDR0, BRBCR, BRBFCR registers and fields into tools/sysreg
- Created dummy BRBINF_EL1 field definitions in tools/sysreg
- Dropped ARMPMU_EVT_PRIV framework which cached perfmon_capable()
- Both exception and exception return branche records are now captured
  only if the event has PERF_SAMPLE_BRANCH_KERNEL which would already
  been checked in generic perf via perf_allow_kernel()

Changes in V2:

https://lore.kernel.org/all/20220908051046.465307-1-anshuman.khandual@arm.com/

- Dropped branch sample filter helpers consolidation patch from this series 
- Added new hw_perf_event.flags element ARMPMU_EVT_PRIV to cache perfmon_capable()
- Use cached perfmon_capable() while configuring BRBE branch record filters

Changes in V1:

https://lore.kernel.org/linux-arm-kernel/20220613100119.684673-1-anshuman.khandual@arm.com/

- Added CONFIG_PERF_EVENTS wrapper for all branch sample filter helpers
- Process new perf branch types via PERF_BR_EXTEND_ABI

Changes in RFC V2:

https://lore.kernel.org/linux-arm-kernel/20220412115455.293119-1-anshuman.khandual@arm.com/

- Added branch_sample_priv() while consolidating other branch sample filter helpers
- Changed all SYS_BRBXXXN_EL1 register definition encodings per Marc
- Changed the BRBE driver as per proposed BRBE related perf ABI changes (V5)
- Added documentation for struct arm_pmu changes, updated commit message
- Updated commit message for BRBE detection infrastructure patch
- PERF_SAMPLE_BRANCH_KERNEL gets checked during arm event init (outside the driver)
- Branch privilege state capture mechanism has now moved inside the driver

Changes in RFC V1:

https://lore.kernel.org/all/1642998653-21377-1-git-send-email-anshuman.khandual@arm.com/

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: James Clark <james.clark@arm.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Suzuki Poulose <suzuki.poulose@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

Anshuman Khandual (10):
  drivers: perf: arm_pmu: Add new sched_task() callback
  arm64/perf: Add BRBE registers and fields
  arm64/perf: Add branch stack support in struct arm_pmu
  arm64/perf: Add branch stack support in struct pmu_hw_events
  arm64/perf: Add branch stack support in ARMV8 PMU
  arm64/perf: Enable branch stack events via FEAT_BRBE
  arm64/perf: Add PERF_ATTACH_TASK_DATA to events with has_branch_stack()
  arm64/perf: Add struct brbe_regset helper functions
  arm64/perf: Implement branch records save on task sched out
  arm64/perf: Implement branch records save on PMU IRQ

 arch/arm/include/asm/arm_pmuv3.h    |  12 +
 arch/arm64/include/asm/perf_event.h |  46 ++
 arch/arm64/include/asm/sysreg.h     | 103 ++++
 arch/arm64/tools/sysreg             | 158 ++++++
 drivers/perf/Kconfig                |  11 +
 drivers/perf/Makefile               |   1 +
 drivers/perf/arm_brbe.c             | 716 ++++++++++++++++++++++++++++
 drivers/perf/arm_brbe.h             | 270 +++++++++++
 drivers/perf/arm_pmu.c              |  12 +-
 drivers/perf/arm_pmuv3.c            | 110 ++++-
 include/linux/perf/arm_pmu.h        |  19 +-
 11 files changed, 1431 insertions(+), 27 deletions(-)
 create mode 100644 drivers/perf/arm_brbe.c
 create mode 100644 drivers/perf/arm_brbe.h

Comments

Will Deacon July 31, 2023, 1:05 p.m. UTC | #1
Hi Anshuman,

On Tue, Jul 11, 2023 at 01:54:45PM +0530, Anshuman Khandual wrote:
> This series enables perf branch stack sampling support on arm64 platform
> via a new arch feature called Branch Record Buffer Extension (BRBE). All
> relevant register definitions could be accessed here.
> 
> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
> 
> This series applies on 6.5-rc1.
> 
> Changes in V13:

I had a go at reviewing this series and, aside from the macro issue I've
already pointed out, I really struggled with the way that you've put the
series together:

  - You incrementally introduce dead code, forcing the reviewer to keep
    previous patches in their head awaiting for a caller to come along
    later.

    Example: Patch 4 literally just adds a new struct to the kernel.

  - You change arch/arm/, where this driver shouldn't even be _compiled_
    despite adding CONFIG_ARM64_BRBE.

    Example: Patch 5 adds some BRBE stubs to
             arch/arm/include/asm/arm_pmuv3.h

  - You undo/rework code that was introduced earlier in the series

    Example: armv8pmu_branch_read() is introduced as a useless stub in
             patch 5, rewritten in patch 6 and then rewritten again in
	     patch 10. Why should I waste time reviewing three versions
	     of this function?

  - You make unrelated cosmetic changes to the existing code inside
    patches adding new features.

    Example: Patch 5 randomly removes some comments from the existing
             code.

  - The commit messages are, at best, useless and err more on the side
    of nonsensical.

    Example: Look at patch 3:

    | This updates 'struct arm_pmu' for branch stack sampling support being added
    | later. This adds an element 'reg_trbidr' to capture BRBE attribute details.
    | These updates here will help in tracking any branch stack sampling support.
    |
    | This also enables perf branch stack sampling event on all 'struct arm pmu',
    | supporting the feature but after removing the current gate that blocks such
    | events unconditionally in armpmu_event_init(). Instead a quick probe can be
    | initiated via arm_pmu->has_branch_stack to ascertain the support.

    If I remove everything that isn't just describing the code, I'm left with:

    - 'reg_trbidr' captures BRBE attribute details
    - These updates here will help in tracking any branch stack sampling support.
    - perf branch stack sampling event is now enabled when it is supported
    - Probing is quick

    But crucial information is missing:

    * What is BRBE?
    * What is a BRBE attribute?
    * How are the details of an attribute captured?
    * How do these "updates" (which ones?) help in tracking branch stack sampling?
    * What is being tracked and why?
    * How quick is the probing and why do we care?
    * What is the perf branch stack sampling event and what does it mean
      to enable it? Does it offer something useful to the user?
    * Why do we want any of this?

(these examples are not intended to be an exhaustive list of things that
need fixing)

Overall, this makes the code needlessly difficult to review. However, I
don't reckon it's too much effort on your side to fix the things above.
You've been doing this for long enough (on the author and reviewer side)
that I hope you see what I'm getting at. If not, try reviewing your own
patches right before you hit 'git send-email'; I pretty much always find
a problem with my own code that way.

So, please, can you post a v14 which:

  1. Fixes the broken register access macros
  2. Adds some meaningful tests at the end of the series
  3. Squashes the new driver code (i.e. at least everything in
     arm_brbe.c and possibly just everything under drivers/perf/) down
     into a single patch
  4. Does any _necessary_ cleanup or refactoring at the start of the
     series, leaving out cosmetic stuff for now
  5. Rewrites the commit messages following the guidelines in
     submitting-patches.rst. You don't need to talk about specific C
     expressions; we have the code for that already and if it's doing
     something subtle then you can add a comment.
  6. Resolves the open CYCLES_COUNT issue from Yang and Suzuki

Cheers,

Will
Anshuman Khandual Aug. 18, 2023, 3:12 a.m. UTC | #2
On 7/31/23 18:35, Will Deacon wrote:
> Hi Anshuman,

Hello Will,

My apologies for the delayed response on this particular thread.

> 
> On Tue, Jul 11, 2023 at 01:54:45PM +0530, Anshuman Khandual wrote:
>> This series enables perf branch stack sampling support on arm64 platform
>> via a new arch feature called Branch Record Buffer Extension (BRBE). All
>> relevant register definitions could be accessed here.
>>
>> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
>>
>> This series applies on 6.5-rc1.
>>
>> Changes in V13:
> 
> I had a go at reviewing this series and, aside from the macro issue I've
> already pointed out, I really struggled with the way that you've put the
> series together:
> 
>   - You incrementally introduce dead code, forcing the reviewer to keep
>     previous patches in their head awaiting for a caller to come along
>     later.

Another way of looking at this would be to prepare existing infrastructure
ready for an upcoming implementation which requires them later. This helps
avoid doing all the changes together in a single code block/patch, making
the entire function addition a giant monolith ?

Don't we allow patch series to build on functions and structures, which are
dead at the beginning, before enabling the entire function with a new config
selectable right at the end ? The only difference here is, these are changes
to various structures without their corresponding utilization some where in
the patch but the real question here would be - is this way of organizing the
patch series not allowed as per the kernel coding standards or guidelines ?

Regardless, if you prefer the suggested style for the series, I could switch
to it.

> 
>     Example: Patch 4 literally just adds a new struct to the kernel.

Which again follows the same principle as explained above, and Mark had acked
that patch earlier. I had assumed that validated the principle about prior
infrastructure building based patch series organization.

> 
>   - You change arch/arm/, where this driver shouldn't even be _compiled_
>     despite adding CONFIG_ARM64_BRBE.

These stubs are necessary to protect PMU drivers built on arm32 which share
basic branch record processing abstraction at ARMV8 PMU level. It compiles
successfully both on arm32 and arm64 platforms with these changes. Subject
line for this patch clearly mentions that as well.

arm64/perf: Add branch stack support in ARMV8 PMU

This is adding abstraction for branch stack support, not the implementation.
This actual implementation will come in the driver and gets enabled via the
new config CONFIG_ARM64_BRBE.

> 
>     Example: Patch 5 adds some BRBE stubs to
>              arch/arm/include/asm/arm_pmuv3.h

Those are not BRBE stubs. Instead those are branch records processing stubs
which is a higher abstraction layer at ARMV8 PMU level, before getting down
into arm64 specific branch records implementation via FEAT_BRBE enabled in
a subsequent driver patch.

armv8_pmu driver is shared for arm32 and arm64. The patch is adding generic
branch record stubs for both architectures, ensuring the armv8_pmu continues
to be built successfully, and finally adds the arm64 implementation with
FEAT_BRBE in the following patches.

> 
>   - You undo/rework code that was introduced earlier in the series
> 
>     Example: armv8pmu_branch_read() is introduced as a useless stub in
>              patch 5, rewritten in patch 6 and then rewritten again in
> 	     patch 10. Why should I waste time reviewing three versions
> 	     of this function?

[PATCH 5/10] adds callback for the branch records processing abstraction at
ARMV8 PMU level, without an implementation any where. The ARMV8 PMUV3 based
arm64 real implementation gets added later on via the BRBE driver, enabled
with new CONFIG_ARM64_BRBE.

armv8pmu_branch_read() is present at all these places.

1. arch/arm/include/asm/arm_pmuv3.h:	Stub to protect the arm32 build
2. arch/arm64/include/asm/perf_event.h:	Default stub without CONFIG_ARM64_BRBE
3. arch/arm64/include/asm/perf_event.h:	Declaration with CONFIG_ARM64_BRBE
4. drivers/perf/arm_brbe.c:		Definition with CONFIG_ARM64_BRBE
5. drivers/perf/arm_pmuv3.c:		Actual call site during PMU IRQ

armv8pmu_branch_read()

- Gets added in [PATCH 5] as branch records stub without an implementation
- Gets defined in [PATCH 6] in the BRBE driver (via CONFIG_ARM64_BRBE)
- Gets re-defined in [PATCH 10] where the new stitching gets implemented

Patch 6 added basic BRBE functionality. But patch 10 adds the necessary logic
to stitch cached records to provide the maximum records possible without loss
of continuity. I kept it separate to keep the complexity away. But I understand
we could fold it in to single patch 6.

> 
>   - You make unrelated cosmetic changes to the existing code inside
>     patches adding new features.
> 
>     Example: Patch 5 randomly removes some comments from the existing
>              code.

Mark had suggested to drop those comments earlier. I could still collect these
unrelated comments removal, on a pre-requisite patch, if that is better. Just
thought them to be too minor changes for a patch in itself.

https://lore.kernel.org/all/Y8AZXQJUO6h5mlgq@FVFF77S0Q05N/

Update: I have posted this cosmetic change as a separate patch which has been
picked up for next merge window.

https://lore.kernel.org/all/20230802090853.1190391-1-anshuman.khandual@arm.com/

> 
>   - The commit messages are, at best, useless and err more on the side
>     of nonsensical.
> 
>     Example: Look at patch 3:
> 
>     | This updates 'struct arm_pmu' for branch stack sampling support being added
>     | later. This adds an element 'reg_trbidr' to capture BRBE attribute details.
>     | These updates here will help in tracking any branch stack sampling support.
>     |
>     | This also enables perf branch stack sampling event on all 'struct arm pmu',
>     | supporting the feature but after removing the current gate that blocks such
>     | events unconditionally in armpmu_event_init(). Instead a quick probe can be
>     | initiated via arm_pmu->has_branch_stack to ascertain the support.
> 
>     If I remove everything that isn't just describing the code, I'm left with:
> 
>     - 'reg_trbidr' captures BRBE attribute details
>     - These updates here will help in tracking any branch stack sampling support.
>     - perf branch stack sampling event is now enabled when it is supported
>     - Probing is quick

I understand your concern here but this series has done many roundabouts, during
the course of its development including multiple rewrites for the same function.
Current 'reg_trbidr' construct was just added during V12. These high churns kept
on adding new problems each time. I will try, and update the commit message here
as suggested.

> 
>     But crucial information is missing:
> 
>     * What is BRBE?
>     * What is a BRBE attribute?

Just wondering - should this be part of cover-letter or in patch commit message ?

>     * How are the details of an attribute captured?
>     * How do these "updates" (which ones?) help in tracking branch stack sampling?
>     * What is being tracked and why?
>     * How quick is the probing and why do we care?
>     * What is the perf branch stack sampling event and what does it mean
>       to enable it? Does it offer something useful to the user?
>     * Why do we want any of this?

I will certainly be more careful in keeping these questions in mind while rewriting
the commit messages.

> 
> (these examples are not intended to be an exhaustive list of things that
> need fixing)

There has been a very small set of folks reviewing and testing the series, hence
the assumption that every one knows about the context was slowly slipping into
the commit messages as well. But I will keep this in mind while re-writing the
commit messages next time around.

>> Overall, this makes the code needlessly difficult to review. However, I
> don't reckon it's too much effort on your side to fix the things above.

Sure, will get on fixing these things as suggested.

> You've been doing this for long enough (on the author and reviewer side)
> that I hope you see what I'm getting at. If not, try reviewing your own
> patches right before you hit 'git send-email'; I pretty much always find
> a problem with my own code that way.

Thanks, I will try and incorporate your valuable suggestions going forward.

> 
> So, please, can you post a v14 which:
> 
>   1. Fixes the broken register access macros>

Already fixed.

> >   2. Adds some meaningful tests at the end of the series

We already have some branch stack sampling tests for perf subsystem.

tools/perf/tests/shell/test_brstack.sh

James Clark is also trying to get some BRBE specific tests added there.

https://gitlab.arm.com/linux-arm/linux-jc/-/commit/3a7ddce70c2daadb63fcc511de0a89055ca48b32

Are you looking for some thing more ? Any particular suggestions ?

>   3. Squashes the new driver code (i.e. at least everything in
>      arm_brbe.c and possibly just everything under drivers/perf/) down
>      into a single patch

Please find the code change stats in drivers/perf/ directory for your reference.

 drivers/perf/arm_brbe.c             | 716 ++++++++++++++++++++++++++++
 drivers/perf/arm_brbe.h             | 270 +++++++++++
 drivers/perf/arm_pmu.c              |  12 +-
 drivers/perf/arm_pmuv3.c            | 110 ++++-

Squashing drivers/perf/arm_brbe.c and drivers/perf/arm_brbe.h will blur these
functions and fold them back into the base driver addition itself.

- Base BRBE implementation enabling perf branch stack sampling support
- sched_task() saving off older branch records when task schedules out
- PMU IRQ branch records management accommodating saved older branch records

But sure, if that is something preferred, will try and fold them down into a
single patch. But will keep the higher level abstraction for branch records
processing at ARV8 PMU level (shared between arm32 and arm64) in a separate
patch before the BRBE based implementation.

>   4. Does any _necessary_ cleanup or refactoring at the start of the
>      series, leaving out cosmetic stuff for now

Sure, will send out a clean up patch dropping those comments first.

Update: Already sent that patch

https://lore.kernel.org/all/20230802090853.1190391-1-anshuman.khandual@arm.com/

>   5. Rewrites the commit messages following the guidelines in
>      submitting-patches.rst. You don't need to talk about specific C
>      expressions; we have the code for that already and if it's doing
>      something subtle then you can add a comment.

Sure, will try and improve the commit message as suggested.

>   6. Resolves the open CYCLES_COUNT issue from Yang and Suzuki

Already fixed.

In order to summerize the patch series re-organization again, V14 will contain
the patches in order.

1.  drivers: perf: arm_pmu: Add new sched_task() callback
2.  arm64/perf: Add BRBE registers and fields
3.  arm64/perf: Add branch stack support in struct arm_pmu
4.  arm64/perf: Add branch stack support in struct pmu_hw_events
5.  arm64/perf: Add branch stack support in ARMV8 PMU
6.  arm64/perf: Enable branch stack events via FEAT_BRBE
 
I am just wondering if both PATCH 3 and PATCH 4 should be folded into PATCH 5
which adds the perf branch stack sampling abstraction at ARMV8 PMU level.

- Anshuman
Will Deacon Aug. 18, 2023, 5:56 p.m. UTC | #3
On Fri, Aug 18, 2023 at 08:42:56AM +0530, Anshuman Khandual wrote:
> On 7/31/23 18:35, Will Deacon wrote:
> > I had a go at reviewing this series and, aside from the macro issue I've
> > already pointed out, I really struggled with the way that you've put the
> > series together:
> > 
> >   - You incrementally introduce dead code, forcing the reviewer to keep
> >     previous patches in their head awaiting for a caller to come along
> >     later.
> 
> My apologies for the delayed response on this particular thread.

It's ok, I'm in no rush to merge it.

> Another way of looking at this would be to prepare existing infrastructure
> ready for an upcoming implementation which requires them later. This helps
> avoid doing all the changes together in a single code block/patch, making
> the entire function addition a giant monolith ?

I'm not after another way of looking at it, I'm telling you that the way
you have structured this is needlessly difficult to review. Yes, we often
split large patches into incremental changes, but that doesn't mean that
breaking a driver up into pieces of dead code and dummy type definitions
is useful. Look at how I structured the original SPE driver for example:

https://lore.kernel.org/all/1507811438-2267-1-git-send-email-will.deacon@arm.com/

The driver code itself lives entirely in patch 7, with the earlier patches
adding self-contained core functionality on which the driver depends. I
didn't add an empty 'struct arm_spe_pmu' did I? This isn't rocket science.

> Don't we allow patch series to build on functions and structures, which are
> dead at the beginning, before enabling the entire function with a new config
> selectable right at the end ? The only difference here is, these are changes
> to various structures without their corresponding utilization some where in
> the patch but the real question here would be - is this way of organizing the
> patch series not allowed as per the kernel coding standards or guidelines ?
> 
> Regardless, if you prefer the suggested style for the series, I could switch
> to it.

Please don't imply that I'm being difficult with an unusual reviewer
preference for how this should be structured. Ask any of your colleagues in
the Arm kernel team and I suspect you'll hear similar concerns to mine.
As I said before, try putting yourself in the shoes of the reviewer and
you'll soon see how needlessly difficult you're making things.

> >     Example: Patch 4 literally just adds a new struct to the kernel.
> 
> Which again follows the same principle as explained above, and Mark had acked
> that patch earlier. I had assumed that validated the principle about prior
> infrastructure building based patch series organization.

I don't think an ack validates any principles about anything.

> >   - You change arch/arm/, where this driver shouldn't even be _compiled_
> >     despite adding CONFIG_ARM64_BRBE.
> 
> These stubs are necessary to protect PMU drivers built on arm32 which share
> basic branch record processing abstraction at ARMV8 PMU level. It compiles
> successfully both on arm32 and arm64 platforms with these changes. Subject
> line for this patch clearly mentions that as well.

But they shouldn't be needed. When CONFIG_ARM64_BRBE is not selected, no
branch record processing functions should be needed, empty or otherwise.
The callers should not exist in the first place (i.e. the empty definitions
should live in the core driver code, not in the arch header, or the calling
code should not be compiled at all).

> arm64/perf: Add branch stack support in ARMV8 PMU
> 
> This is adding abstraction for branch stack support, not the implementation.
> This actual implementation will come in the driver and gets enabled via the
> new config CONFIG_ARM64_BRBE.
> 
> > 
> >     Example: Patch 5 adds some BRBE stubs to
> >              arch/arm/include/asm/arm_pmuv3.h
> 
> Those are not BRBE stubs. 

Oh, come off it. You literally have a useless comment:

/* BRBE stubs */

which is apparently even more useless than I thought. Now, can we stop
wasting time arguing about which level of abstraction you think these
are at and start focussing on removing them, please?

> Instead those are branch records processing stubs
> which is a higher abstraction layer at ARMV8 PMU level, before getting down
> into arm64 specific branch records implementation via FEAT_BRBE enabled in
> a subsequent driver patch.
> 
> armv8_pmu driver is shared for arm32 and arm64. The patch is adding generic
> branch record stubs for both architectures, ensuring the armv8_pmu continues
> to be built successfully, and finally adds the arm64 implementation with
> FEAT_BRBE in the following patches.
> 
> > 
> >   - You undo/rework code that was introduced earlier in the series
> > 
> >     Example: armv8pmu_branch_read() is introduced as a useless stub in
> >              patch 5, rewritten in patch 6 and then rewritten again in
> > 	     patch 10. Why should I waste time reviewing three versions
> > 	     of this function?
> 
> [PATCH 5/10] adds callback for the branch records processing abstraction at
> ARMV8 PMU level, without an implementation any where. The ARMV8 PMUV3 based
> arm64 real implementation gets added later on via the BRBE driver, enabled
> with new CONFIG_ARM64_BRBE.
> 
> armv8pmu_branch_read() is present at all these places.
> 
> 1. arch/arm/include/asm/arm_pmuv3.h:	Stub to protect the arm32 build
> 2. arch/arm64/include/asm/perf_event.h:	Default stub without CONFIG_ARM64_BRBE
> 3. arch/arm64/include/asm/perf_event.h:	Declaration with CONFIG_ARM64_BRBE
> 4. drivers/perf/arm_brbe.c:		Definition with CONFIG_ARM64_BRBE
> 5. drivers/perf/arm_pmuv3.c:		Actual call site during PMU IRQ
> 
> armv8pmu_branch_read()
> 
> - Gets added in [PATCH 5] as branch records stub without an implementation
> - Gets defined in [PATCH 6] in the BRBE driver (via CONFIG_ARM64_BRBE)
> - Gets re-defined in [PATCH 10] where the new stitching gets implemented
> 
> Patch 6 added basic BRBE functionality. But patch 10 adds the necessary logic
> to stitch cached records to provide the maximum records possible without loss
> of continuity. I kept it separate to keep the complexity away. But I understand
> we could fold it in to single patch 6.

I don't know what you're trying to say here, but you seem to be reiterating
my point that you're adding armv8pmu_branch_read() three times during the
patch series.

Cutting to the main part of your email:

> > So, please, can you post a v14 which:
> > 
> >   1. Fixes the broken register access macros>
> 
> Already fixed.

Thank you, and what a horrifying assembler/compiler divergence that turned
out to be!

> > >   2. Adds some meaningful tests at the end of the series
> 
> We already have some branch stack sampling tests for perf subsystem.
> 
> tools/perf/tests/shell/test_brstack.sh
> 
> James Clark is also trying to get some BRBE specific tests added there.
> 
> https://gitlab.arm.com/linux-arm/linux-jc/-/commit/3a7ddce70c2daadb63fcc511de0a89055ca48b32
> 
> Are you looking for some thing more ? Any particular suggestions ?

That looks great, but afaict it's 8 months old and hasn't been posted to the
mailing list. Did I miss it? Why isn't it part of this series?

> >   3. Squashes the new driver code (i.e. at least everything in
> >      arm_brbe.c and possibly just everything under drivers/perf/) down
> >      into a single patch
> 
> Please find the code change stats in drivers/perf/ directory for your reference.
> 
>  drivers/perf/arm_brbe.c             | 716 ++++++++++++++++++++++++++++
>  drivers/perf/arm_brbe.h             | 270 +++++++++++
>  drivers/perf/arm_pmu.c              |  12 +-
>  drivers/perf/arm_pmuv3.c            | 110 ++++-
> 
> Squashing drivers/perf/arm_brbe.c and drivers/perf/arm_brbe.h will blur these
> functions and fold them back into the base driver addition itself.

It's hard to know what to do with a diffstat, so I'll wait for v14 (but
please note that this is too late for 6.6 now).

> - Base BRBE implementation enabling perf branch stack sampling support
> - sched_task() saving off older branch records when task schedules out
> - PMU IRQ branch records management accommodating saved older branch records

Why would we want a driver that doesn't handle IRQs or context switching? I
don't think you gain anything by splitting the driver like this.

> In order to summerize the patch series re-organization again, V14 will contain
> the patches in order.
> 
> 1.  drivers: perf: arm_pmu: Add new sched_task() callback
> 2.  arm64/perf: Add BRBE registers and fields
> 3.  arm64/perf: Add branch stack support in struct arm_pmu
> 4.  arm64/perf: Add branch stack support in struct pmu_hw_events
> 5.  arm64/perf: Add branch stack support in ARMV8 PMU
> 6.  arm64/perf: Enable branch stack events via FEAT_BRBE
>  
> I am just wondering if both PATCH 3 and PATCH 4 should be folded into PATCH 5
> which adds the perf branch stack sampling abstraction at ARMV8 PMU level.

I can't tell from just the subjects, but patches 3-5 (and possibly 6)
sound like they should be the same patch?

But again, subject lines and diffstats aren't especially helpful.

Will
Anshuman Khandual Aug. 21, 2023, 8:53 a.m. UTC | #4
Hello Will,

Separated this part out just to understand it better.

On 8/18/23 23:26, Will Deacon wrote:
>> These stubs are necessary to protect PMU drivers built on arm32 which share
>> basic branch record processing abstraction at ARMV8 PMU level. It compiles
>> successfully both on arm32 and arm64 platforms with these changes. Subject
>> line for this patch clearly mentions that as well.

> But they shouldn't be needed. When CONFIG_ARM64_BRBE is not selected, no
> branch record processing functions should be needed, empty or otherwise.

But that is not how the code is organized currently. CONFIG_ARM64_BRBE enables
the driver to bring in BRBE HW specific branch stack callback implementation
details. But these callbacks are always present at ARMV8 PMU level even without
BRBE implementation as well. Hence these call sites are present, regardless of
CONFIG_ARM64_BRBE.

> The callers should not exist in the first place (i.e. the empty definitions

But how to achieve that ? Branch stack needs to be driven along side normal PMU
events, which in turn get driven from 'struct arm_pmu' callbacks. Hence branch
stack callbacks too needs to be at ARMV8 PMU level, and closely tied to normal
PMU event handling callbacks. Let's examine from where all these new callbacks
are called from.

* armv8pmu_disable_event()	---> armv8pmu_branch_disable()
* armv8pmu_handle_irq()		---> armv8pmu_branch_read()
* armv8pmu_sched_task()		---> armv8pmu_branch_save()
* armv8pmu_sched_task()		---> armv8pmu_branch_reset()
* armv8pmu_reset()		---> armv8pmu_branch_reset()
* __armv8_pmuv3_map_event()	---> armv8pmu_branch_attr_valid()
* __armv8pmu_probe_pmu()	---> armv8pmu_branch_probe()
* armv8pmu_probe_pmu()		---> armv8pmu_task_ctx_cache_alloc()
* armv8pmu_probe_pmu()		---> branch_records_alloc()

Please note that, branch_records_alloc() being deferred allocation is only one
that is platform agnostic.

I did not intend to make any of the BRBE details visible at ARMV8 PMU level i.e
right inside armv8pmu_XXXX() definitions, as ARMV8 PMU is shared between arm32
and arm64 platforms.

Hence these new branch stack callbacks are added along with required fallback
stubs for build protection, so that the HW implementations can be hidden inside
a new driver wrapped in CONFIG_ARM64_BRBE. Please note that these new branch
stack functions are not 'struct arm_pmu' callbacks but instead normal helpers.

> should live in the core driver code, not in the arch header, or the calling

Driver code is compiled with CONFIG_ARM64_BRBE, hence the real definitions are
there. Instead default stubs are required only when armv8pmu_branch_XXX() call
backs are not defined. But are you suggesting that these stubs be moved inside
drivers/perf/arm_pmuv3.c (where all call sites reside), so that there will be
just one stub copy both for arm32 and arm64 platforms removing their duplicate
definitions from arch headers i.e arch/arm64/include/asm/perf_event.h and
arch/arm/include/asm/arm_pmuv3.h ?

> code should not be compiled at all).

Branch stack sampling is always enabled from core perf perspective without any
config option to wrap around, hence calling code cannot be selectively enabled
or disabled.

- Anshuman
Anshuman Khandual Sept. 27, 2023, 8:37 a.m. UTC | #5
On 7/11/23 13:54, Anshuman Khandual wrote:
> This series enables perf branch stack sampling support on arm64 platform
> via a new arch feature called Branch Record Buffer Extension (BRBE). All
> relevant register definitions could be accessed here.
> 
> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
> 
> This series applies on 6.5-rc1.
> 
> Changes in V13:
> 
> https://lore.kernel.org/all/20230622065351.1092893-1-anshuman.khandual@arm.com/
> 
> - Added branch callback stubs for aarch32 pmuv3 based platforms
> - Updated the comments for capture_brbe_regset()
> - Deleted the comments in __read_brbe_regset()
> - Reversed the arguments order in capture_brbe_regset() and brbe_branch_save()
> - Fixed BRBE_BANK[0|1]_IDX_MAX indices comparison in armv8pmu_branch_read()
> - Fixed BRBE_BANK[0|1]_IDX_MAX indices comparison in capture_brbe_regset()

Hello All,

Upcoming V14 series is still being worked out, and might take some more time. But
meanwhile please find its corresponding development branch for experimentation
purpose, if required. Thank you.

https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v14_rc3)

- Anshuman